IMDb is the world’s most popular online database containing ratings and review for information related to movies, television series, you name it. As consumers, we want to look at what other people think of a movie or show we might be interested in watching, and IMDb is often the go-to destination. The aim of this project is to see what factors might have an influence on predicting a netflix movie or show’s genre, and to see whether it’s different between the two types. Both of the datasets for movies and shows we are using are from kaggle.
library(dplyr)
library(ggplot2)
library(tidymodels)
library(tidyverse)
library(naniar)
library(patchwork) # plotting graphs side by side
library(corrplot) # correlation plot
library(ggthemes)
library(kableExtra)
library(glmnet)
library(kknn) # for knn
library(ranger) # for random forest
library(xgboost) # for boost trees
library(yardstick)
library(vip)
tidymodels_prefer()
movies1 <- read.csv("/Users/alainaliu/Downloads/PSTAT 131/Netflix Project/Best Movies Netflix.csv")
shows1 <- read.csv("/Users/alainaliu/Downloads/PSTAT 131/Netflix Project/Best Shows Netflix.csv")
movies1[,-1] %>% kable() %>%
kable_styling("striped", full_width = FALSE) %>%
scroll_box(height = "420px")
| TITLE | RELEASE_YEAR | SCORE | NUMBER_OF_VOTES | DURATION | MAIN_GENRE | MAIN_PRODUCTION |
|---|---|---|---|---|---|---|
| David Attenborough: A Life on Our Planet | 2020 | 9.0 | 31180 | 83 | documentary | GB |
| Inception | 2010 | 8.8 | 2268288 | 148 | scifi | GB |
| Forrest Gump | 1994 | 8.8 | 1994599 | 142 | drama | US |
| Anbe Sivam | 2003 | 8.7 | 20595 | 160 | comedy | IN |
| Bo Burnham: Inside | 2021 | 8.7 | 44074 | 87 | comedy | US |
| Saving Private Ryan | 1998 | 8.6 | 1346020 | 169 | drama | US |
| Django Unchained | 2012 | 8.4 | 1472668 | 165 | western | US |
| Dangal | 2016 | 8.4 | 180247 | 161 | action | IN |
| Bo Burnham: Make Happy | 2016 | 8.4 | 14356 | 60 | comedy | US |
| Louis C.K.: Hilarious | 2010 | 8.4 | 11973 | 84 | comedy | US |
| Dave Chappelle: Sticks & Stones | 2019 | 8.4 | 25687 | 65 | comedy | US |
| 3 Idiots | 2009 | 8.4 | 385782 | 170 | comedy | IN |
| Black Friday | 2004 | 8.4 | 20611 | 143 | crime | IN |
| Super Deluxe | 2019 | 8.4 | 13680 | 176 | thriller | IN |
| Winter on Fire: Ukraine’s Fight for Freedom | 2015 | 8.3 | 17710 | 98 | documentary | UA |
| Once Upon a Time in America | 1984 | 8.3 | 342335 | 229 | drama | US |
| Taxi Driver | 1976 | 8.3 | 795222 | 113 | crime | US |
| Like Stars on Earth | 2007 | 8.3 | 188234 | 165 | drama | IN |
| Bo Burnham: What. | 2013 | 8.3 | 11488 | 60 | comedy | US |
| Full Metal Jacket | 1987 | 8.3 | 723306 | 116 | drama | GB |
| Warrior | 2011 | 8.2 | 463276 | 140 | drama | US |
| Drishyam | 2015 | 8.2 | 79075 | 163 | thriller | IN |
| Queen | 2014 | 8.2 | 64805 | 146 | drama | IN |
| Paan Singh Tomar | 2012 | 8.2 | 35888 | 135 | drama | IN |
| Cowspiracy: The Sustainability Secret | 2014 | 8.2 | 24845 | 90 | documentary | US |
| Virunga | 2014 | 8.2 | 11403 | 90 | war | CD |
| PK | 2014 | 8.2 | 178012 | 153 | comedy | IN |
| Bāhubali 2: The Conclusion | 2017 | 8.2 | 91560 | 168 | fantasy | IN |
| Monty Python and the Holy Grail | 1975 | 8.2 | 530877 | 91 | comedy | GB |
| Article 15 | 2019 | 8.2 | 32336 | 130 | crime | IN |
| Miracle in Cell No. 7 | 2019 | 8.2 | 46939 | 132 | drama | TR |
| 13th | 2016 | 8.2 | 34914 | 100 | documentary | US |
| Andhadhun | 2018 | 8.2 | 88359 | 139 | thriller | IN |
| Bill Burr: Paper Tiger | 2019 | 8.1 | 10649 | 67 | comedy | GB |
| Udaan | 2010 | 8.1 | 44556 | 138 | drama | IN |
| How to Train Your Dragon | 2010 | 8.1 | 719717 | 98 | fantasy | US |
| Klaus | 2019 | 8.1 | 141480 | 97 | comedy | ES |
| Swades | 2004 | 8.1 | 89085 | 189 | drama | IN |
| Minnal Murali | 2021 | 8.1 | 24681 | 158 | action | IN |
| Rang De Basanti | 2006 | 8.1 | 118092 | 157 | comedy | IN |
| Seaspiracy | 2021 | 8.1 | 29604 | 80 | documentary | US |
| Rush | 2013 | 8.1 | 465254 | 123 | drama | US |
| Hannah Gadsby: Nanette | 2018 | 8.1 | 12035 | 69 | comedy | AU |
| Barfi! | 2012 | 8.1 | 80643 | 151 | drama | IN |
| Haider | 2014 | 8.1 | 54001 | 150 | drama | IN |
| Zindagi Na Milegi Dobara | 2011 | 8.1 | 75801 | 166 | comedy | IN |
| A Silent Voice: The Movie | 2016 | 8.1 | 75132 | 130 | romance | JP |
| OMG: Oh My God! | 2012 | 8.1 | 57449 | 125 | fantasy | IN |
| Talvar | 2015 | 8.1 | 34659 | 132 | thriller | IN |
| Into the Wild | 2007 | 8.1 | 611379 | 140 | drama | US |
| Lagaan: Once Upon a Time in India | 2001 | 8.1 | 111053 | 224 | drama | IN |
| My Octopus Teacher | 2020 | 8.1 | 51232 | 84 | documentary | ZA |
| Dil Chahta Hai | 2001 | 8.1 | 71167 | 183 | drama | IN |
| Mersal | 2017 | 8.1 | 32573 | 169 | thriller | IN |
| The Legend of Bhagat Singh | 2002 | 8.1 | 16225 | 155 | drama | IN |
| Stand by Me | 1986 | 8.1 | 392790 | 89 | drama | US |
| The Exorcist | 1973 | 8.1 | 391942 | 133 | horror | US |
| Bombay | 1995 | 8.1 | 12512 | 141 | romance | IN |
| Dasvi | 2022 | 8.0 | 13140 | 125 | drama | IN |
| G.O.R.A. | 2004 | 8.0 | 61797 | 127 | scifi | TR |
| Blood Diamond | 2006 | 8.0 | 536858 | 143 | thriller | US |
| Vizontele | 2001 | 8.0 | 36291 | 110 | comedy | TR |
| Ip Man | 2008 | 8.0 | 221095 | 108 | drama | HK |
| Her | 2013 | 8.0 | 586679 | 126 | drama | US |
| The Bourne Ultimatum | 2007 | 8.0 | 627009 | 115 | thriller | DE |
| Casino Royale | 2006 | 8.0 | 644336 | 139 | thriller | GB |
| Special 26 | 2013 | 8.0 | 55489 | 144 | thriller | IN |
| Neon Genesis Evangelion: The End of Evangelion | 1997 | 8.0 | 51938 | 87 | scifi | JP |
| Bāhubali: The Beginning | 2015 | 8.0 | 117333 | 159 | drama | IN |
| Ankhon Dekhi | 2014 | 8.0 | 11330 | 104 | drama | IN |
| Big Fish | 2003 | 8.0 | 435503 | 125 | drama | US |
| Silenced | 2011 | 8.0 | 15889 | 125 | drama | KR |
| Life of Brian | 1979 | 8.0 | 392419 | 94 | comedy | GB |
| The Invisible Guest | 2017 | 8.0 | 170351 | 107 | thriller | ES |
| Dave Chappelle: The Closer | 2021 | 8.0 | 24903 | 72 | comedy | US |
| Blade Runner 2049 | 2017 | 8.0 | 539864 | 164 | scifi | CA |
| The Imitation Game | 2014 | 8.0 | 748654 | 113 | thriller | US |
| Jab We Met | 2007 | 7.9 | 51945 | 138 | comedy | IN |
| Free to Play | 2014 | 7.9 | 13308 | 75 | documentary | UA |
| Monty Python Live at the Hollywood Bowl | 1982 | 7.9 | 15186 | 77 | comedy | GB |
| Dev.D | 2009 | 7.9 | 30389 | 144 | drama | IN |
| Marriage Story | 2019 | 7.9 | 290643 | 136 | drama | GB |
| Icarus | 2017 | 7.9 | 48672 | 121 | documentary | US |
| I Am Not Your Negro | 2017 | 7.9 | 21632 | 93 | documentary | BE |
| Ricky Gervais: Humanity | 2018 | 7.9 | 18523 | 79 | comedy | GB |
| Secret Superstar | 2017 | 7.9 | 24046 | 150 | drama | IN |
| Kal Ho Naa Ho | 2003 | 7.9 | 68028 | 186 | drama | IN |
| Pad Man | 2018 | 7.9 | 25269 | 140 | comedy | IN |
| Shyam Singha Roy | 2021 | 7.9 | 10903 | 157 | drama | IN |
| How to Train Your Dragon 2 | 2014 | 7.8 | 327565 | 102 | fantasy | US |
| The Irishman | 2019 | 7.8 | 371209 | 209 | drama | US |
| Gangaajal | 2003 | 7.8 | 17029 | 157 | drama | IN |
| The Girl with the Dragon Tattoo | 2011 | 7.8 | 454917 | 158 | crime | NO |
| Lakshya | 2004 | 7.8 | 23076 | 186 | drama | IN |
| The Social Network | 2010 | 7.8 | 681286 | 121 | drama | US |
| Dunkirk | 2017 | 7.8 | 619645 | 107 | drama | US |
| Kai Po Che! | 2013 | 7.8 | 36512 | 126 | drama | IN |
| The Gentlemen | 2019 | 7.8 | 314049 | 113 | comedy | US |
| Marco Polo: One Hundred Eyes | 2015 | 7.8 | 10742 | 28 | action | US |
| The Hateful Eight | 2015 | 7.8 | 570138 | 188 | western | US |
| Hunt for the Wilderpeople | 2016 | 7.8 | 125720 | 101 | comedy | NZ |
| The Last Samurai | 2003 | 7.8 | 429097 | 154 | drama | NZ |
| Gattaca | 1997 | 7.8 | 298168 | 106 | thriller | US |
| The Butterfly’s Dream | 2013 | 7.8 | 21882 | 138 | drama | TR |
| 14 Peaks: Nothing Is Impossible | 2021 | 7.8 | 22858 | 101 | documentary | US |
| Nightcrawler | 2014 | 7.8 | 523686 | 118 | crime | US |
| Udta Punjab | 2016 | 7.8 | 29819 | 148 | crime | IN |
| My Fair Lady | 1964 | 7.8 | 94121 | 170 | drama | US |
| System Crasher | 2019 | 7.8 | 12699 | 118 | drama | DE |
| The Game Changers | 2019 | 7.8 | 19708 | 88 | documentary | US |
| Awakenings | 1990 | 7.8 | 137549 | 120 | drama | US |
| Badla | 2019 | 7.8 | 27130 | 120 | thriller | IN |
| Madras Cafe | 2013 | 7.7 | 24319 | 130 | thriller | IN |
| Beasts of No Nation | 2015 | 7.7 | 80129 | 137 | war | US |
| Bonnie and Clyde | 1967 | 7.7 | 111189 | 110 | drama | US |
| Roma | 2018 | 7.7 | 153508 | 135 | drama | MX |
| Kapoor & Sons | 2016 | 7.7 | 25792 | 132 | romance | IN |
| Mucize | 2015 | 7.7 | 12395 | 136 | drama | TR |
| Wind River | 2017 | 7.7 | 240408 | 106 | thriller | FR |
| Oye Lucky! Lucky Oye! | 2008 | 7.7 | 17411 | 126 | comedy | IN |
| Dirty Harry | 1971 | 7.7 | 153463 | 102 | thriller | US |
| Jim & Andy: The Great Beyond - Featuring a Very Special, Contractually Obligated Mention of Tony Clifton | 2017 | 7.7 | 25593 | 94 | comedy | US |
| Berserk: The Golden Age Arc II - The Battle for Doldrey | 2012 | 7.7 | 10257 | 80 | fantasy | JP |
| Silver Linings Playbook | 2012 | 7.7 | 697481 | 122 | drama | US |
| Sanju | 2018 | 7.7 | 52227 | 161 | drama | IN |
| Argo | 2012 | 7.7 | 600392 | 120 | drama | US |
| Guru | 2007 | 7.7 | 23541 | 166 | romance | IN |
| Road to Perdition | 2002 | 7.7 | 263212 | 117 | thriller | US |
| The Trial of the Chicago 7 | 2020 | 7.7 | 170728 | 130 | drama | US |
| When Harry Met Sally… | 1989 | 7.7 | 212913 | 96 | romance | US |
| Rock On!! | 2008 | 7.7 | 21963 | 144 | drama | IN |
| In the Family | 2017 | 7.7 | 23297 | 124 | comedy | TR |
| Pyaar Ka Punchnama | 2011 | 7.7 | 21204 | 149 | romance | IN |
| Midnight in Paris | 2011 | 7.7 | 413541 | 94 | fantasy | US |
| Donnie Brasco | 1997 | 7.7 | 300073 | 127 | thriller | US |
| The Blind Side | 2009 | 7.6 | 323939 | 129 | drama | US |
| Sherlock Holmes | 2009 | 7.6 | 620154 | 129 | crime | GB |
| Stardust | 2007 | 7.6 | 269043 | 122 | fantasy | US |
| Kuch Kuch Hota Hai | 1998 | 7.6 | 51640 | 185 | drama | IN |
| Parmanu: The Story of Pokhran | 2018 | 7.6 | 23771 | 129 | drama | IN |
| Doctor | 2021 | 7.6 | 14590 | 150 | thriller | IN |
| What Happened, Miss Simone? | 2015 | 7.6 | 13703 | 101 | musical | US |
| Eddie Murphy Raw | 1987 | 7.6 | 19646 | 93 | comedy | US |
| Delhi Belly | 2011 | 7.6 | 29578 | 102 | comedy | IN |
| The Boy Who Harnessed the Wind | 2019 | 7.6 | 36805 | 113 | drama | MW |
| Kabhi Haan Kabhi Naa | 1994 | 7.6 | 18224 | 158 | comedy | IN |
| The Two Popes | 2019 | 7.6 | 120871 | 125 | drama | US |
| The Social Dilemma | 2020 | 7.6 | 79674 | 94 | drama | US |
| Bad Genius | 2017 | 7.6 | 20430 | 130 | thriller | TH |
| The Mitchells vs. the Machines | 2021 | 7.6 | 100787 | 113 | animation | US |
| Gifted Hands: The Ben Carson Story | 2009 | 7.6 | 10210 | 86 | drama | US |
| Tell Me Who I Am | 2019 | 7.6 | 14215 | 85 | thriller | GB |
| RBG | 2018 | 7.6 | 14037 | 98 | documentary | US |
| Highway | 2014 | 7.6 | 28370 | 133 | drama | IN |
| Athlete A | 2020 | 7.6 | 10544 | 104 | documentary | US |
| Lupin the Third: The Castle of Cagliostro | 1979 | 7.6 | 30277 | 100 | comedy | JP |
| Love Actually | 2003 | 7.6 | 474176 | 139 | drama | GB |
| Hell or High Water | 2016 | 7.6 | 224900 | 102 | western | US |
| Wake Up Sid | 2009 | 7.6 | 30818 | 138 | comedy | IN |
| Ludo | 2020 | 7.6 | 37528 | 150 | crime | IN |
| Stree | 2018 | 7.6 | 32814 | 128 | horror | IN |
| Aamir | 2008 | 7.6 | 11241 | 99 | thriller | IN |
| I Am Sam | 2001 | 7.6 | 149082 | 132 | drama | US |
| True Grit | 2010 | 7.6 | 333378 | 110 | western | US |
| The Distinguished Citizen | 2016 | 7.5 | 11495 | 118 | comedy | AR |
| Jodhaa Akbar | 2008 | 7.5 | 32188 | 213 | romance | IN |
| The Conjuring | 2013 | 7.5 | 491048 | 107 | thriller | US |
| Dhamaka | 2021 | 7.5 | 39620 | 104 | thriller | IN |
| Bareilly Ki Barfi | 2017 | 7.5 | 23011 | 123 | romance | IN |
| Les Misérables | 2012 | 7.5 | 325132 | 157 | drama | GB |
| Berserk: The Golden Age Arc I - The Egg of the King | 2012 | 7.5 | 12278 | 76 | fantasy | JP |
| Dil Se.. | 1998 | 7.5 | 28409 | 163 | drama | IN |
| Omar | 2013 | 7.5 | 14230 | 96 | thriller | PS |
| I Lost My Body | 2019 | 7.5 | 31531 | 81 | fantasy | FR |
| tick, tick… BOOM! | 2021 | 7.5 | 96418 | 121 | drama | US |
| Coming Soon | 2014 | 7.5 | 33714 | 134 | drama | TR |
| Nocturnal Animals | 2016 | 7.5 | 264884 | 115 | drama | US |
| A Monster Calls | 2016 | 7.5 | 86614 | 108 | fantasy | ES |
| On Body and Soul | 2017 | 7.5 | 27003 | 116 | fantasy | HU |
| Ala Vaikunthapurramuloo | 2020 | 7.5 | 14839 | 165 | drama | IN |
| The Devil’s Advocate | 1997 | 7.5 | 361422 | 144 | horror | DE |
| Sivaji: The Boss | 2007 | 7.5 | 19556 | 189 | drama | IN |
| Happy as Lazzaro | 2018 | 7.5 | 17716 | 125 | fantasy | IT |
| The Guns of Navarone | 1961 | 7.5 | 50150 | 158 | war | US |
| Who Am I | 2014 | 7.5 | 55044 | 105 | thriller | DE |
| White Christmas | 1954 | 7.5 | 42373 | 115 | romance | US |
| Ip Man 2 | 2010 | 7.5 | 103673 | 108 | drama | CN |
| 42 | 2013 | 7.5 | 93314 | 128 | drama | US |
| Menace II Society | 1993 | 7.5 | 57399 | 97 | drama | US |
| Hum Aapke Hain Koun..! | 1994 | 7.5 | 20986 | 206 | romance | IN |
| Blow | 2001 | 7.5 | 255099 | 124 | drama | US |
| Begin Again | 2013 | 7.4 | 154049 | 104 | comedy | US |
| Uncut Gems | 2019 | 7.4 | 261956 | 130 | drama | US |
| Sherlock Holmes: A Game of Shadows | 2011 | 7.4 | 446531 | 129 | crime | US |
| Rurouni Kenshin Part I: Origins | 2012 | 7.4 | 25793 | 134 | drama | JP |
| Forgotten | 2017 | 7.4 | 29804 | 109 | thriller | KR |
| Tamasha | 2015 | 7.4 | 26790 | 139 | drama | IN |
| Mudbound | 2017 | 7.4 | 47676 | 120 | drama | US |
| Lady Bird | 2017 | 7.4 | 277165 | 94 | drama | US |
| Molly’s Game | 2017 | 7.4 | 165817 | 140 | drama | CA |
| Raman Raghav 2.0 | 2016 | 7.4 | 14380 | 134 | thriller | IN |
| Once Upon a Time in Mumbaai | 2010 | 7.4 | 17494 | 132 | thriller | IN |
| Jaane Tu… Ya Jaane Na | 2008 | 7.4 | 26738 | 155 | drama | IN |
| Peepli Live | 2010 | 7.4 | 12265 | 104 | drama | IN |
| Even the Rain | 2010 | 7.4 | 13446 | 104 | drama | ES |
| Darkest Hour | 2017 | 7.4 | 193208 | 125 | drama | GB |
| The Bucket List | 2007 | 7.4 | 242733 | 97 | drama | US |
| American Factory | 2019 | 7.4 | 21415 | 110 | documentary | US |
| Miss Americana | 2020 | 7.4 | 19151 | 85 | documentary | US |
| The Hand of God | 2021 | 7.4 | 30235 | 130 | drama | IT |
| A Nightmare on Elm Street | 1984 | 7.4 | 230543 | 91 | horror | US |
| 83 | 2021 | 7.4 | 23781 | 163 | drama | IN |
| Meenakshi Sundareshwar | 2021 | 7.4 | 17141 | 141 | comedy | IN |
| Kurup | 2021 | 7.4 | 11582 | 155 | crime | IN |
| Crazy, Stupid, Love. | 2011 | 7.4 | 507878 | 118 | romance | US |
| Schumacher | 2021 | 7.4 | 21558 | 112 | sports | DE |
| Guzaarish | 2010 | 7.4 | 18466 | 126 | drama | IN |
| What the Health | 2017 | 7.4 | 28911 | 97 | documentary | US |
| Mirage | 2018 | 7.4 | 52657 | 129 | thriller | ES |
| Bully | 2011 | 7.4 | 10266 | 92 | drama | US |
| Phantom Thread | 2017 | 7.4 | 128600 | 130 | romance | US |
| Looper | 2012 | 7.4 | 566791 | 119 | thriller | US |
| Felon | 2008 | 7.4 | 78039 | 103 | crime | US |
| Life in a Metro | 2007 | 7.4 | 11934 | 124 | drama | IN |
| Kabhi Khushi Kabhie Gham | 2001 | 7.4 | 48818 | 210 | drama | IN |
| Kaminey | 2009 | 7.4 | 17136 | 135 | drama | IN |
| Girl, Interrupted | 1999 | 7.3 | 180532 | 127 | drama | US |
| Talaash | 2012 | 7.3 | 41752 | 149 | thriller | IN |
| Starship Troopers | 1997 | 7.3 | 288960 | 129 | scifi | US |
| The Guernsey Literary & Potato Peel Pie Society | 2018 | 7.3 | 44917 | 124 | romance | GB |
| Okja | 2017 | 7.3 | 116305 | 122 | drama | KR |
| Ishqiya | 2010 | 7.3 | 10415 | 115 | comedy | IN |
| The Edge of Seventeen | 2016 | 7.3 | 120488 | 104 | comedy | US |
| Pyaar Ka Punchnama 2 | 2015 | 7.3 | 14968 | 159 | comedy | IN |
| The Nightingale | 2018 | 7.3 | 28196 | 136 | thriller | AU |
| Corpse Bride | 2005 | 7.3 | 265023 | 77 | fantasy | US |
| Memoirs of a Geisha | 2005 | 7.3 | 146847 | 145 | drama | FR |
| El Camino: A Breaking Bad Movie | 2019 | 7.3 | 216847 | 123 | thriller | US |
| Official Secrets | 2019 | 7.3 | 45200 | 112 | thriller | GB |
| Coach Carter | 2005 | 7.3 | 143670 | 130 | drama | US |
| The King | 2019 | 7.3 | 119020 | 140 | drama | AU |
| Identity | 2003 | 7.3 | 240433 | 90 | thriller | US |
| The Best of Enemies | 2019 | 7.3 | 16441 | 133 | drama | US |
| The Professionals | 1966 | 7.3 | 16168 | 117 | western | US |
| Raat Akeli Hai | 2020 | 7.3 | 17570 | 149 | thriller | IN |
| Little Women | 1994 | 7.3 | 57621 | 115 | drama | US |
| Badhaai Do | 2022 | 7.3 | 15032 | 147 | comedy | IN |
| The Conjuring 2 | 2016 | 7.3 | 260693 | 134 | thriller | US |
| The Fundamentals of Caring | 2016 | 7.3 | 70542 | 97 | drama | US |
| The Ballad of Buster Scruggs | 2018 | 7.3 | 141528 | 132 | western | US |
| Shot Caller | 2017 | 7.3 | 83961 | 120 | thriller | US |
| The Disaster Artist | 2017 | 7.3 | 149604 | 104 | comedy | US |
| Blue Jay | 2016 | 7.3 | 17033 | 81 | romance | US |
| Te3n | 2016 | 7.3 | 12816 | 136 | thriller | IN |
| Monster | 2003 | 7.3 | 149218 | 110 | crime | US |
| No One Killed Jessica | 2011 | 7.2 | 11665 | 136 | crime | IN |
| Toilet: A Love Story | 2017 | 7.2 | 20675 | 155 | comedy | IN |
| Wish Dragon | 2021 | 7.2 | 24712 | 99 | fantasy | CN |
| Mom | 2017 | 7.2 | 10320 | 147 | crime | IN |
| The Professor and the Madman | 2019 | 7.2 | 44418 | 124 | thriller | US |
| Kabir Singh | 2019 | 7.2 | 30949 | 172 | drama | IN |
| Dolemite Is My Name | 2019 | 7.2 | 59836 | 118 | comedy | US |
| In the Line of Fire | 1993 | 7.2 | 101939 | 128 | drama | US |
| The Witcher: Nightmare of the Wolf | 2021 | 7.2 | 41508 | 83 | fantasy | PL |
| Ittefaq | 2017 | 7.2 | 12095 | 107 | thriller | IN |
| The Tinder Swindler | 2022 | 7.2 | 57606 | 114 | crime | GB |
| First They Killed My Father | 2017 | 7.2 | 17871 | 136 | drama | KH |
| Don’t Look Up | 2021 | 7.2 | 498447 | 138 | scifi | US |
| Wazir | 2016 | 7.2 | 18681 | 103 | thriller | IN |
| Gabbar Is Back | 2015 | 7.2 | 24676 | 130 | drama | IN |
| Fyre | 2019 | 7.2 | 44715 | 98 | documentary | US |
| Steve Jobs | 2015 | 7.2 | 166288 | 122 | drama | GB |
| Shooter | 2007 | 7.2 | 329417 | 124 | thriller | US |
| The Butler | 2013 | 7.2 | 114013 | 132 | drama | US |
| The Siege of Jadotville | 2016 | 7.2 | 38308 | 108 | thriller | IE |
| Closer | 2004 | 7.2 | 215678 | 94 | drama | GB |
| Private Life | 2018 | 7.2 | 19023 | 123 | drama | US |
| American Murder: The Family Next Door | 2020 | 7.2 | 26355 | 83 | crime | US |
| St. Vincent | 2014 | 7.2 | 102103 | 102 | comedy | US |
| A River Runs Through It | 1992 | 7.2 | 59086 | 123 | drama | US |
| The Patriot | 2000 | 7.2 | 270231 | 165 | drama | DE |
| Five Feet Apart | 2019 | 7.2 | 61878 | 116 | romance | US |
| Michael Clayton | 2007 | 7.2 | 163878 | 120 | thriller | US |
| The Edge of Democracy | 2019 | 7.2 | 14605 | 121 | documentary | BR |
| Paddington | 2014 | 7.2 | 111092 | 96 | comedy | FR |
| Wedding Association | 2013 | 7.1 | 22186 | 106 | comedy | XX |
| Let Me In | 2010 | 7.1 | 120208 | 116 | horror | GB |
| Raajneeti | 2010 | 7.1 | 17555 | 167 | drama | IN |
| Body of Lies | 2008 | 7.1 | 224896 | 128 | drama | GB |
| The Forgotten Battle | 2020 | 7.1 | 26368 | 124 | drama | LT |
| Knock Down the House | 2019 | 7.1 | 12418 | 86 | documentary | US |
| The Call | 2020 | 7.1 | 29450 | 112 | thriller | KR |
| Forgetting Sarah Marshall | 2008 | 7.1 | 280121 | 111 | comedy | US |
| The Railway Man | 2013 | 7.1 | 39743 | 116 | drama | GB |
| The Great Hack | 2019 | 7.1 | 22838 | 114 | documentary | US |
| Paddleton | 2019 | 7.1 | 13419 | 89 | drama | US |
| Desperado | 1995 | 7.1 | 183638 | 104 | thriller | US |
| Gantz:O | 2016 | 7.1 | 14501 | 95 | animation | JP |
| The Ring | 2002 | 7.1 | 341888 | 111 | horror | JP |
| Margin Call | 2011 | 7.1 | 125883 | 107 | thriller | US |
| Copenhagen | 2014 | 7.1 | 13135 | 98 | drama | US |
| Black Mirror: Bandersnatch | 2018 | 7.1 | 123377 | 90 | scifi | GB |
| Metallica: Through the Never | 2013 | 7.1 | 17433 | 93 | musical | US |
| Seven Years in Tibet | 1997 | 7.1 | 141308 | 136 | drama | US |
| Girl | 2018 | 7.1 | 14046 | 105 | drama | NL |
| Kung Fu Panda 3 | 2016 | 7.1 | 152791 | 95 | comedy | US |
| To All the Boys I’ve Loved Before | 2018 | 7.1 | 101175 | 100 | romance | US |
| Arthur Christmas | 2011 | 7.1 | 58296 | 97 | drama | GB |
| Trailer Park Boys: The Movie | 2006 | 7.1 | 12831 | 95 | comedy | CA |
| Karthik Calling Karthik | 2010 | 7.1 | 11944 | 135 | thriller | IN |
| The White Tiger | 2021 | 7.1 | 58190 | 125 | drama | IN |
| Shootout at Lokhandwala | 2007 | 7.1 | 10139 | 145 | crime | IN |
| The Devil All the Time | 2020 | 7.1 | 122321 | 138 | drama | US |
| Tangerine | 2015 | 7.1 | 31385 | 87 | comedy | US |
| The Danish Girl | 2015 | 7.1 | 180805 | 119 | drama | DE |
| The Unforgivable | 2021 | 7.1 | 101975 | 112 | drama | DE |
| Phir Hera Pheri | 2006 | 7.1 | 22505 | 155 | comedy | IN |
| War Dogs | 2016 | 7.1 | 208185 | 114 | crime | US |
| The Dig | 2021 | 7.1 | 71915 | 112 | drama | GB |
| Blade | 1998 | 7.1 | 267181 | 120 | action | US |
| Luck by Chance | 2009 | 7.1 | 10206 | 155 | romance | IN |
| Namastey London | 2007 | 7.1 | 21745 | 131 | comedy | IN |
| Don | 2006 | 7.1 | 36836 | 178 | thriller | IN |
| Don 2 | 2011 | 7.1 | 52338 | 139 | action | DE |
| Main Hoon Na | 2004 | 7.0 | 35142 | 179 | drama | IN |
| Gangubai Kathiawadi | 2022 | 7.0 | 44045 | 157 | drama | IN |
| Croupier | 1998 | 7.0 | 21382 | 94 | thriller | FR |
| Pieces of a Woman | 2020 | 7.0 | 47795 | 127 | drama | HU |
| Raw | 2016 | 7.0 | 72460 | 98 | horror | BE |
| Loving | 2016 | 7.0 | 34139 | 123 | drama | GB |
| Harold & Kumar Go to White Castle | 2004 | 7.0 | 193053 | 88 | comedy | US |
| Haseen Dillruba | 2021 | 7.0 | 25771 | 135 | drama | IN |
| Rose Island | 2020 | 7.0 | 20019 | 117 | drama | IT |
| Ferry | 2021 | 7.0 | 10748 | 106 | drama | NL |
| Handsome Devil | 2017 | 7.0 | 13708 | 95 | drama | IE |
| Torbaaz | 2020 | 7.0 | 17828 | 132 | drama | IN |
| Ip Man 3 | 2015 | 7.0 | 54128 | 105 | drama | HK |
| The Foreigner | 2017 | 7.0 | 112487 | 113 | thriller | IN |
| Mirai | 2018 | 7.0 | 14983 | 98 | fantasy | JP |
| Gaga: Five Foot Two | 2017 | 7.0 | 12825 | 100 | musical | US |
| The Christmas Chronicles | 2018 | 7.0 | 71182 | 104 | fantasy | US |
| Sarkar | 2018 | 7.0 | 18453 | 164 | drama | IN |
| Rambo | 2008 | 7.0 | 228799 | 92 | thriller | US |
| Dil Dhadakne Do | 2015 | 7.0 | 17149 | 170 | drama | IN |
| Soul Surfer | 2011 | 7.0 | 49101 | 112 | drama | US |
| Happy Gilmore | 1996 | 7.0 | 217534 | 92 | comedy | US |
| Den of Thieves | 2018 | 7.0 | 107701 | 140 | thriller | US |
| The Zookeeper’s Wife | 2017 | 7.0 | 42808 | 122 | drama | GB |
| The Platform | 2019 | 7.0 | 207877 | 94 | horror | ES |
| Public Enemies | 2009 | 7.0 | 297525 | 143 | crime | US |
| Ip Man 4: The Finale | 2019 | 7.0 | 30694 | 107 | drama | CN |
| John Q | 2002 | 7.0 | 131999 | 116 | drama | US |
| Fashion | 2008 | 6.9 | 12468 | 167 | romance | IN |
| Ma Rainey’s Black Bottom | 2020 | 6.9 | 50275 | 94 | musical | US |
| Any Given Sunday | 1999 | 6.9 | 118479 | 162 | drama | US |
| The Long Riders | 1980 | 6.9 | 11329 | 99 | western | US |
| Get on Up | 2014 | 6.9 | 24456 | 139 | drama | US |
| Fukrey | 2013 | 6.9 | 11656 | 137 | romance | IN |
| Chup Chup Ke | 2006 | 6.9 | 10528 | 165 | comedy | IN |
| Then Came You | 2019 | 6.9 | 12356 | 93 | comedy | US |
| The Wolf’s Call | 2019 | 6.9 | 17236 | 115 | thriller | FR |
| Cloudy with a Chance of Meatballs | 2009 | 6.9 | 226225 | 90 | animation | US |
| Our Souls at Night | 2017 | 6.9 | 13360 | 101 | drama | US |
| Chandigarh Kare Aashiqui | 2021 | 6.9 | 12303 | 117 | crime | IN |
| Welcome | 2007 | 6.9 | 21799 | 160 | romance | IN |
| Everybody Knows | 2018 | 6.9 | 34009 | 132 | drama | IT |
| The Meyerowitz Stories (New and Selected) | 2017 | 6.9 | 47971 | 112 | drama | US |
| I Don’t Feel at Home in This World Anymore | 2017 | 6.9 | 55549 | 93 | drama | US |
| The Highwaymen | 2019 | 6.9 | 88714 | 132 | thriller | US |
| Legend | 2015 | 6.9 | 175273 | 132 | thriller | US |
| AK vs AK | 2020 | 6.9 | 14048 | 108 | drama | IN |
| My Girl | 1991 | 6.9 | 79800 | 103 | comedy | US |
| Amanda Knox | 2016 | 6.9 | 23969 | 92 | crime | DK |
| 2 States | 2014 | 6.9 | 25344 | 149 | comedy | IN |
| The Power of the Dog | 2021 | 6.9 | 158487 | 126 | drama | CA |
| The Half of It | 2020 | 6.9 | 34959 | 104 | comedy | US |
| Outlaw King | 2018 | 6.9 | 69834 | 121 | drama | GB |
| Mary Kom | 2014 | 6.9 | 10656 | 122 | drama | IN |
| Suffragette | 2015 | 6.9 | 41529 | 106 | drama | FR |
| Legend of the Guardians: The Owls of Ga’Hoole | 2010 | 6.9 | 82623 | 100 | fantasy | US |
| Christine | 2016 | 6.9 | 14977 | 115 | drama | US |
| The Night Comes for Us | 2018 | 6.9 | 25500 | 121 | thriller | ID |
| The Trip | 2021 | 6.9 | 19706 | 113 | comedy | NO |
| The Dirt | 2019 | 6.9 | 47603 | 108 | drama | US |
| Top Gun | 1986 | 6.9 | 329656 | 110 | drama | US |
| Radhe Shyam | 2022 | 6.9 | 21328 | 138 | romance | IN |
| Sorry to Bother You | 2018 | 6.9 | 75653 | 111 | fantasy | US |
shows1[,-1] %>% kable() %>%
kable_styling("striped", full_width = FALSE) %>%
scroll_box(height = "420px")
| TITLE | RELEASE_YEAR | SCORE | NUMBER_OF_VOTES | DURATION | NUMBER_OF_SEASONS | MAIN_GENRE | MAIN_PRODUCTION |
|---|---|---|---|---|---|---|---|
| Breaking Bad | 2008 | 9.5 | 1727694 | 48 | 5 | drama | US |
| Avatar: The Last Airbender | 2005 | 9.3 | 297336 | 24 | 3 | scifi | US |
| Our Planet | 2019 | 9.3 | 41386 | 50 | 1 | documentary | GB |
| Kota Factory | 2019 | 9.3 | 66985 | 42 | 2 | drama | IN |
| The Last Dance | 2020 | 9.1 | 108321 | 50 | 1 | documentary | US |
| Arcane | 2021 | 9.1 | 175412 | 41 | 1 | action | US |
| Attack on Titan | 2013 | 9.0 | 325381 | 24 | 4 | scifi | JP |
| Hunter x Hunter | 2011 | 9.0 | 87857 | 23 | 3 | drama | JP |
| DEATH NOTE | 2006 | 9.0 | 302147 | 24 | 1 | scifi | JP |
| Seinfeld | 1989 | 8.9 | 302700 | 24 | 9 | comedy | US |
| Cowboy Bebop | 1998 | 8.9 | 112887 | 25 | 1 | western | JP |
| Heartstopper | 2022 | 8.9 | 28978 | 28 | 1 | drama | GB |
| When They See Us | 2019 | 8.9 | 114127 | 74 | 1 | drama | US |
| Monty Python’s Flying Circus | 1969 | 8.8 | 72895 | 30 | 4 | comedy | GB |
| BoJack Horseman | 2014 | 8.8 | 143584 | 26 | 6 | drama | US |
| Chappelle’s Show | 2003 | 8.8 | 62140 | 21 | 3 | comedy | US |
| Better Call Saul | 2015 | 8.8 | 404920 | 49 | 6 | comedy | US |
| Narcos | 2015 | 8.8 | 404486 | 52 | 3 | drama | US |
| One Piece | 1999 | 8.8 | 112586 | 23 | 21 | action | JP |
| Peaky Blinders | 2013 | 8.8 | 485506 | 58 | 6 | drama | GB |
| Anne with an E | 2017 | 8.7 | 51001 | 46 | 3 | drama | CA |
| Dark | 2017 | 8.7 | 354443 | 56 | 3 | scifi | DE |
| House of Cards | 2013 | 8.7 | 494092 | 52 | 6 | drama | US |
| Demon Slayer: Kimetsu no Yaiba | 2019 | 8.7 | 88265 | 25 | 3 | animation | JP |
| Stranger Things | 2016 | 8.7 | 989090 | 52 | 5 | scifi | US |
| One-Punch Man | 2015 | 8.7 | 148386 | 24 | 2 | action | JP |
| The Crown | 2016 | 8.7 | 190878 | 56 | 5 | drama | US |
| Arrested Development | 2003 | 8.7 | 297552 | 28 | 5 | comedy | US |
| Friday Night Lights | 2006 | 8.7 | 64449 | 43 | 5 | drama | US |
| Downton Abbey | 2010 | 8.7 | 197744 | 58 | 6 | drama | GB |
| Code Geass: Lelouch of the Rebellion | 2006 | 8.7 | 62367 | 24 | 3 | scifi | JP |
| Trailer Park Boys | 2001 | 8.6 | 41791 | 25 | 12 | comedy | CA |
| Mindhunter | 2017 | 8.6 | 261429 | 53 | 2 | crime | US |
| The Haunting of Hill House | 2018 | 8.6 | 226817 | 58 | 1 | drama | US |
| The Queen’s Gambit | 2020 | 8.6 | 406350 | 56 | 1 | drama | US |
| Cobra Kai | 2018 | 8.6 | 163858 | 31 | 5 | action | US |
| Wentworth | 2013 | 8.6 | 21747 | 46 | 9 | drama | AU |
| It’s Okay to Not Be Okay | 2020 | 8.6 | 21104 | 76 | 1 | drama | KR |
| Making a Murderer | 2015 | 8.6 | 93456 | 62 | 2 | crime | US |
| Shameless | 2011 | 8.6 | 230243 | 54 | 11 | drama | US |
| Sacred Games | 2018 | 8.6 | 85088 | 50 | 2 | action | IN |
| Formula 1: Drive to Survive | 2019 | 8.6 | 36661 | 38 | 6 | documentary | GB |
| Queer Eye | 2018 | 8.5 | 18147 | 47 | 6 | reality | US |
| Community | 2009 | 8.5 | 252564 | 23 | 6 | comedy | US |
| The Last Kingdom | 2015 | 8.5 | 126473 | 55 | 5 | action | GB |
| Schitt’s Creek | 2015 | 8.5 | 112537 | 22 | 6 | comedy | CA |
| Neon Genesis Evangelion | 1995 | 8.5 | 64727 | 24 | 1 | scifi | JP |
| Call the Midwife | 2012 | 8.5 | 25562 | 56 | 11 | drama | GB |
| The IT Crowd | 2006 | 8.5 | 147409 | 25 | 5 | comedy | GB |
| ERASED | 2016 | 8.5 | 42699 | 22 | 1 | drama | JP |
| Supernatural | 2005 | 8.5 | 428639 | 45 | 15 | scifi | US |
| Ozark | 2017 | 8.5 | 278223 | 60 | 4 | crime | US |
| Delhi Crime | 2019 | 8.5 | 18732 | 27 | 1 | drama | IN |
| After Life | 2019 | 8.5 | 124972 | 28 | 3 | comedy | GB |
| Hilda | 2018 | 8.5 | 10162 | 25 | 2 | scifi | GB |
| Borgen | 2010 | 8.5 | 23523 | 58 | 4 | drama | DK |
| Ash vs Evil Dead | 2015 | 8.4 | 70087 | 30 | 3 | action | US |
| Heartland | 2007 | 8.4 | 15743 | 44 | 15 | drama | CA |
| Unbelievable | 2019 | 8.4 | 95658 | 48 | 1 | drama | US |
| The Promised Neverland | 2019 | 8.4 | 34730 | 23 | 2 | scifi | JP |
| Derry Girls | 2018 | 8.4 | 28718 | 25 | 3 | comedy | GB |
| Innocent | 2017 | 8.4 | 17727 | 51 | 1 | drama | TR |
| Trollhunters: Tales of Arcadia | 2016 | 8.4 | 16509 | 22 | 3 | action | US |
| Outlander | 2014 | 8.4 | 152435 | 60 | 6 | scifi | US |
| The Legend of Korra | 2012 | 8.4 | 117464 | 23 | 4 | action | US |
| The Dark Crystal: Age of Resistance | 2019 | 8.4 | 24164 | 51 | 1 | scifi | GB |
| Vincenzo | 2021 | 8.4 | 15134 | 81 | 1 | action | KR |
| Top Boy | 2011 | 8.4 | 22420 | 48 | 2 | drama | GB |
| Babylon Berlin | 2017 | 8.4 | 23256 | 49 | 4 | crime | DE |
| Stargate SG-1 | 1997 | 8.4 | 90196 | 44 | 10 | scifi | US |
| Violet Evergarden | 2018 | 8.4 | 19940 | 25 | 1 | drama | JP |
| Narcos: Mexico | 2018 | 8.4 | 80902 | 56 | 3 | drama | US |
| The Dragon Prince | 2018 | 8.4 | 21635 | 27 | 3 | scifi | US |
| Maid | 2021 | 8.4 | 74955 | 54 | 1 | drama | US |
| Car Masters: Rust to Riches | 2018 | 8.4 | 10024 | 39 | 3 | reality | US |
| Naruto | 2002 | 8.4 | 93980 | 23 | 6 | scifi | JP |
| The Originals | 2013 | 8.3 | 131574 | 43 | 5 | scifi | US |
| Fauda | 2015 | 8.3 | 25239 | 40 | 3 | war | IL |
| Sex Education | 2019 | 8.3 | 251168 | 52 | 4 | drama | GB |
| Kingdom | 2019 | 8.3 | 43760 | 48 | 2 | action | KR |
| Money Heist | 2017 | 8.3 | 450797 | 50 | 5 | crime | ES |
| Young Royals | 2021 | 8.3 | 26732 | 45 | 1 | drama | SE |
| Atypical | 2017 | 8.3 | 81643 | 31 | 4 | comedy | US |
| Gurren Lagann | 2007 | 8.3 | 17024 | 24 | 1 | scifi | JP |
| 30 Rock | 2006 | 8.3 | 121514 | 25 | 7 | comedy | US |
| Castlevania | 2017 | 8.3 | 61114 | 26 | 4 | scifi | US |
| Kim’s Convenience | 2016 | 8.3 | 16970 | 22 | 5 | comedy | CA |
| Master of None | 2015 | 8.3 | 72341 | 32 | 3 | drama | US |
| Longmire | 2012 | 8.3 | 34362 | 53 | 6 | western | US |
| Call My Agent! | 2015 | 8.3 | 13331 | 55 | 4 | comedy | FR |
| The Get Down | 2016 | 8.2 | 22304 | 62 | 2 | drama | US |
| Sense8 | 2015 | 8.2 | 151518 | 62 | 2 | scifi | US |
| One Day at a Time | 2017 | 8.2 | 15669 | 29 | 4 | comedy | US |
| The Kominsky Method | 2018 | 8.2 | 38232 | 26 | 3 | drama | US |
| Itaewon Class | 2020 | 8.2 | 12030 | 69 | 1 | drama | KR |
| The Good Place | 2016 | 8.2 | 148562 | 23 | 4 | scifi | US |
| Grace and Frankie | 2015 | 8.2 | 48435 | 30 | 7 | comedy | US |
| The Witcher | 2019 | 8.2 | 465949 | 58 | 2 | scifi | US |
| Locked Up | 2015 | 8.2 | 21388 | 50 | 4 | drama | ES |
| Gilmore Girls | 2000 | 8.2 | 119054 | 46 | 8 | comedy | US |
| Caliphate | 2020 | 8.2 | 17735 | 48 | 1 | war | SE |
| Fate/Zero | 2011 | 8.2 | 12568 | 25 | 2 | scifi | JP |
| The Walking Dead | 2010 | 8.2 | 945125 | 46 | 11 | action | US |
| anohana: The Flower We Saw That Day | 2011 | 8.2 | 12682 | 23 | 1 | drama | JP |
| The Trials of Gabriel Fernandez | 2020 | 8.1 | 10982 | 55 | 1 | crime | US |
| How to Get Away with Murder | 2014 | 8.1 | 146712 | 43 | 6 | drama | US |
| Derek | 2013 | 8.1 | 31976 | 26 | 3 | drama | GB |
| Rascal Does Not Dream of Bunny Girl Senpai | 2018 | 8.1 | 10400 | 25 | 1 | animation | JP |
| Wild Wild Country | 2018 | 8.1 | 29298 | 67 | 1 | crime | US |
| Bodyguard | 2018 | 8.1 | 114446 | 60 | 2 | war | GB |
| American Vandal | 2017 | 8.1 | 29972 | 33 | 2 | comedy | US |
| The End of the F***ing World | 2017 | 8.1 | 177868 | 21 | 2 | crime | GB |
| Manhunt | 2017 | 8.1 | 57459 | 43 | 2 | documentary | US |
| Criminal Minds | 2005 | 8.1 | 189191 | 44 | 15 | thriller | US |
| HAPPY! | 2017 | 8.1 | 37747 | 43 | 2 | scifi | US |
| Orange Is the New Black | 2013 | 8.1 | 295591 | 59 | 7 | drama | US |
| Star Trek: Deep Space Nine | 1993 | 8.1 | 61145 | 47 | 7 | scifi | US |
| Lovesick | 2014 | 8.1 | 19259 | 24 | 3 | comedy | GB |
| Lucifer | 2016 | 8.1 | 308291 | 47 | 6 | scifi | US |
| F is for Family | 2015 | 8.0 | 36050 | 28 | 5 | comedy | US |
| Don’t F**k with Cats: Hunting an Internet Killer | 2019 | 8.0 | 50250 | 62 | 1 | crime | US |
| GLOW | 2017 | 8.0 | 44751 | 34 | 4 | drama | US |
| The Umbrella Academy | 2019 | 8.0 | 202522 | 52 | 3 | comedy | US |
| Tabula Rasa | 2017 | 8.0 | 10161 | 52 | 1 | drama | BE |
| Dead to Me | 2019 | 8.0 | 73110 | 31 | 2 | drama | US |
| The Mechanism | 2018 | 8.0 | 36077 | 47 | 2 | drama | BR |
| Squid Game | 2021 | 8.0 | 416738 | 54 | 2 | action | KR |
| Toradora! | 2008 | 8.0 | 14307 | 30 | 2 | animation | JP |
| Comedians in Cars Getting Coffee | 2012 | 8.0 | 12363 | 20 | 11 | comedy | US |
| Marco Polo | 2014 | 8.0 | 71229 | 55 | 2 | action | US |
| Travelers | 2016 | 8.0 | 54566 | 45 | 3 | scifi | CA |
| The Blacklist | 2013 | 8.0 | 238138 | 43 | 9 | drama | US |
| Unorthodox | 2020 | 8.0 | 74118 | 54 | 1 | drama | DE |
| Inside Bill’s Brain: Decoding Bill Gates | 2019 | 7.9 | 10776 | 50 | 1 | documentary | US |
| On My Block | 2018 | 7.9 | 15642 | 29 | 4 | comedy | US |
| Medici: Masters of Florence | 2016 | 7.9 | 18575 | 54 | 3 | war | GB |
| The Seven Deadly Sins | 2014 | 7.9 | 30341 | 24 | 5 | action | JP |
| Queen of the South | 2016 | 7.9 | 28537 | 42 | 5 | drama | US |
| KILL la KILL | 2013 | 7.9 | 13372 | 25 | 1 | action | JP |
| Bloodline | 2015 | 7.9 | 50028 | 57 | 3 | drama | US |
| Into the Badlands | 2015 | 7.9 | 45628 | 43 | 3 | action | US |
| Merlin | 2008 | 7.9 | 80138 | 44 | 5 | action | GB |
| InuYasha | 2000 | 7.9 | 15823 | 25 | 9 | action | JP |
| Altered Carbon | 2018 | 7.9 | 162018 | 52 | 2 | scifi | US |
| The OA | 2016 | 7.9 | 100911 | 55 | 2 | scifi | US |
| Resurrection: Ertugrul | 2014 | 7.9 | 35515 | 57 | 5 | war | TR |
| The Borgias | 2011 | 7.9 | 50556 | 51 | 3 | drama | CA |
| Lilyhammer | 2012 | 7.9 | 29196 | 45 | 3 | crime | NO |
| Boys Over Flowers | 2009 | 7.9 | 11579 | 64 | 1 | comedy | KR |
| I Think You Should Leave with Tim Robinson | 2019 | 7.9 | 11411 | 16 | 3 | comedy | US |
| Suburra: Blood on Rome | 2017 | 7.9 | 14346 | 48 | 3 | crime | IT |
| The Sinner | 2017 | 7.9 | 117055 | 45 | 4 | crime | US |
| The Spy | 2019 | 7.9 | 40535 | 53 | 1 | drama | FR |
| Santa Clarita Diet | 2017 | 7.9 | 64467 | 29 | 3 | comedy | US |
| How to Sell Drugs Online (Fast) | 2019 | 7.9 | 30734 | 32 | 3 | drama | DE |
| Rise of Empires: Ottoman | 2020 | 7.9 | 22045 | 45 | 2 | documentary | TR |
| Big Mouth | 2017 | 7.9 | 74660 | 27 | 6 | animation | US |
| Versailles | 2015 | 7.9 | 16273 | 54 | 3 | drama | CA |
| She-Ra and the Princesses of Power | 2018 | 7.8 | 14935 | 24 | 5 | scifi | US |
| A Series of Unfortunate Events | 2017 | 7.8 | 59239 | 47 | 3 | action | US |
| Gotham | 2014 | 7.8 | 226081 | 43 | 5 | scifi | US |
| Workin’ Moms | 2017 | 7.8 | 14892 | 22 | 6 | comedy | CA |
| Russian Doll | 2019 | 7.8 | 88945 | 28 | 2 | drama | US |
| Sweet Tooth | 2021 | 7.8 | 49182 | 45 | 2 | scifi | US |
| Imposters | 2017 | 7.8 | 13238 | 43 | 2 | drama | US |
| iZombie | 2015 | 7.8 | 66488 | 42 | 5 | scifi | US |
| Conversations with a Killer: The Ted Bundy Tapes | 2019 | 7.8 | 27439 | 59 | 1 | crime | US |
| Never Have I Ever | 2020 | 7.8 | 45346 | 28 | 4 | drama | US |
| My Name | 2021 | 7.8 | 18446 | 50 | 1 | thriller | KR |
| Happy Endings | 2011 | 7.8 | 37778 | 21 | 3 | comedy | US |
| Crazy Ex-Girlfriend | 2015 | 7.8 | 19738 | 42 | 4 | comedy | US |
| El Chapo | 2017 | 7.8 | 17992 | 44 | 3 | drama | US |
| NCIS | 2003 | 7.8 | 141049 | 44 | 19 | action | US |
| Jane the Virgin | 2014 | 7.8 | 46286 | 42 | 5 | drama | US |
| Undercover | 2019 | 7.8 | 16773 | 49 | 3 | drama | BE |
| DOTA: Dragon’s Blood | 2021 | 7.8 | 17429 | 26 | 2 | scifi | US |
| The Innocent | 2021 | 7.8 | 29019 | 58 | 1 | crime | ES |
| The Staircase | 2004 | 7.8 | 21531 | 49 | 2 | crime | FR |
| Giri/Haji | 2019 | 7.8 | 13570 | 58 | 1 | thriller | GB |
| Angel Beats! | 2010 | 7.7 | 13848 | 26 | 1 | scifi | JP |
| Love | 2016 | 7.7 | 41362 | 32 | 3 | drama | US |
| In the Dark | 2019 | 7.7 | 10927 | 41 | 4 | comedy | US |
| Midnight Mass | 2021 | 7.7 | 102321 | 64 | 1 | action | US |
| The Vampire Diaries | 2009 | 7.7 | 310776 | 42 | 8 | scifi | US |
| Alias Grace | 2017 | 7.7 | 31577 | 44 | 1 | drama | CA |
| You | 2018 | 7.7 | 225949 | 47 | 3 | thriller | US |
| Norsemen | 2016 | 7.7 | 18040 | 30 | 3 | comedy | NO |
| Good Girls | 2018 | 7.7 | 49867 | 42 | 4 | comedy | US |
| Seven Seconds | 2018 | 7.7 | 15323 | 62 | 1 | crime | US |
| The Chestnut Man | 2021 | 7.7 | 41253 | 55 | 1 | crime | DK |
| Crashing | 2016 | 7.7 | 18546 | 30 | 1 | comedy | GB |
| Maniac | 2018 | 7.7 | 74877 | 39 | 1 | drama | US |
| New Girl | 2011 | 7.7 | 216209 | 21 | 7 | comedy | US |
| Miraculous: Tales of Ladybug & Cat Noir | 2015 | 7.7 | 10102 | 22 | 4 | romance | FR |
| Home for Christmas | 2019 | 7.7 | 13670 | 29 | 2 | comedy | NO |
| Inside Job | 2021 | 7.6 | 15137 | 28 | 1 | comedy | US |
| The Serpent | 2021 | 7.6 | 41782 | 57 | 1 | drama | GB |
| Alice in Borderland | 2020 | 7.6 | 47651 | 47 | 2 | action | JP |
| Spinning Out | 2020 | 7.6 | 13692 | 50 | 1 | drama | US |
| Messiah | 2020 | 7.6 | 42018 | 45 | 1 | drama | US |
| Teenage Bounty Hunters | 2020 | 7.6 | 10821 | 49 | 1 | action | US |
| Shadow and Bone | 2021 | 7.6 | 77782 | 52 | 1 | scifi | US |
| The Devil Next Door | 2019 | 7.6 | 13084 | 46 | 1 | documentary | US |
| My Little Pony: Friendship Is Magic | 2010 | 7.6 | 20708 | 22 | 9 | scifi | CA |
| The Magicians | 2015 | 7.6 | 49557 | 44 | 5 | drama | US |
| Bordertown | 2016 | 7.6 | 10371 | 59 | 3 | drama | FI |
| Unbreakable Kimmy Schmidt | 2015 | 7.6 | 70242 | 30 | 4 | comedy | US |
| Chicago Med | 2015 | 7.6 | 24142 | 41 | 8 | drama | US |
| The 100 | 2014 | 7.6 | 242221 | 42 | 7 | drama | US |
| The Flash | 2014 | 7.6 | 336888 | 42 | 8 | scifi | US |
| Cable Girls | 2017 | 7.6 | 13855 | 50 | 5 | drama | ES |
| Criminal: UK | 2019 | 7.6 | 17992 | 44 | 2 | drama | GB |
| Devilman Crybaby | 2018 | 7.6 | 18575 | 25 | 1 | scifi | JP |
| Madam Secretary | 2014 | 7.6 | 22563 | 43 | 6 | war | US |
| Sword Art Online | 2012 | 7.6 | 43727 | 23 | 4 | scifi | JP |
| Gilmore Girls: A Year in the Life | 2016 | 7.6 | 36359 | 92 | 1 | comedy | US |
| Dead Set | 2008 | 7.6 | 19684 | 141 | 1 | scifi | GB |
| Grey’s Anatomy | 2005 | 7.6 | 293618 | 49 | 18 | drama | US |
| Goosebumps | 1995 | 7.6 | 13361 | 22 | 4 | scifi | CA |
| Love 101 | 2020 | 7.5 | 13797 | 46 | 2 | comedy | TR |
| Shooter | 2016 | 7.5 | 35547 | 41 | 3 | war | US |
| Reign | 2013 | 7.5 | 47751 | 42 | 4 | drama | US |
| Blue Exorcist | 2011 | 7.5 | 12741 | 24 | 2 | scifi | JP |
| Outer Banks | 2020 | 7.5 | 43404 | 49 | 2 | action | US |
| Arrow | 2012 | 7.5 | 425716 | 42 | 8 | action | US |
| Halston | 2021 | 7.5 | 14040 | 47 | 1 | drama | US |
| The Night Shift | 2014 | 7.5 | 12069 | 42 | 4 | drama | US |
| The Heirs | 2013 | 7.5 | 10329 | 59 | 1 | drama | KR |
| The Politician | 2019 | 7.5 | 21887 | 43 | 2 | comedy | US |
| Ragnarok | 2020 | 7.5 | 36185 | 47 | 2 | action | NO |
| Everything Sucks! | 2018 | 7.5 | 18023 | 24 | 1 | drama | US |
| Frequency | 2016 | 7.5 | 13625 | 18 | 3 | scifi | US |
| Dark Matter | 2015 | 7.5 | 41867 | 43 | 3 | scifi | CA |
| Night Stalker: The Hunt for a Serial Killer | 2021 | 7.5 | 23939 | 47 | 1 | crime | US |
| Dash & Lily | 2020 | 7.5 | 16978 | 25 | 1 | comedy | US |
| True Story | 2021 | 7.5 | 16927 | 39 | 1 | drama | US |
| Hollywood | 2020 | 7.5 | 35067 | 50 | 1 | drama | US |
| Designated Survivor | 2016 | 7.5 | 88019 | 44 | 3 | war | US |
| Quicksand | 2019 | 7.5 | 21077 | 46 | 1 | drama | SE |
| Feel Good | 2020 | 7.5 | 10317 | 25 | 2 | drama | GB |
| Dogs of Berlin | 2018 | 7.5 | 12453 | 60 | 1 | drama | DE |
| Evil Genius | 2018 | 7.5 | 27516 | 48 | 1 | crime | US |
| 13 Reasons Why | 2017 | 7.5 | 282373 | 58 | 4 | drama | US |
| Lupin | 2021 | 7.5 | 100575 | 46 | 3 | crime | FR |
| All of Us Are Dead | 2022 | 7.5 | 41393 | 61 | 1 | action | KR |
| I Am Not Okay with This | 2020 | 7.5 | 56459 | 21 | 1 | comedy | US |
nrow(movies1); nrow(shows1)
## [1] 387
## [1] 246
Categories for Movies:
Title
Release Year
IMDb Score
Number of Votes
Duration (in minutes)
Main Genre
Main Production (Country Code)
Categories for Shows:
Title
Release Year
IMDb Score
Number of Votes
Duration (in minutes)
Number of Seasons
Main Genre
Main Production (Country Code)
The two datasets contain the same information with only the exception of shows having an additional variable, number of seasons. There are 387 movie observations and 246 shows observations.
We have a general idea of what we are working with. Next we want to look into each of the variables and check if any transformations need to be done, as well as if there is any missing data.
vis_miss(movies1); vis_miss(shows1)
From the missing values map, there are no missing data, so we will not have to remove any observations here.
m1 <- movies1 %>%
group_by(MAIN_GENRE) %>%
summarize(count = n()) %>%
ggplot(aes(x=count, y=reorder(MAIN_GENRE, count))) +
geom_bar(stat = "identity", fill = "#ff9896") +
geom_text(aes(label=count), vjust=0.5, hjust=-0.25, size=3) +
theme_hc() +
labs(x = "Count", y = "Main Genre")
s1 <- shows1 %>%
group_by(MAIN_GENRE) %>%
summarize(count = n()) %>%
ggplot(aes(x=count, y=reorder(MAIN_GENRE, count))) +
geom_bar(stat = "identity", fill = "#c49c94") +
geom_text(aes(label=count), vjust=0.5, hjust=-0.25, size=3) +
theme_hc() +
labs(x = "Count", y = "Main Genre")
m1 + s1
m2 <- movies1 %>%
group_by(MAIN_PRODUCTION) %>%
summarize(count = n()) %>%
ggplot(aes(x=count, y=reorder(MAIN_PRODUCTION, count))) +
geom_bar(stat = "identity", fill = "#ff9896") +
geom_text(aes(label=count), vjust=0.5, hjust=-0.25, size=3) +
theme_hc() +
labs(x = "Count", y = "Main Production")
s2 <- shows1 %>%
group_by(MAIN_PRODUCTION) %>%
summarize(count = n()) %>%
ggplot(aes(x=count, y=reorder(MAIN_PRODUCTION, count))) +
geom_bar(stat = "identity", fill = "#c49c94") +
geom_text(aes(label=count), vjust=0.5, hjust=-0.25, size=3) +
theme_hc() +
labs(x = "Count", y = "Main Production")
m2 + s2
As we can see, there are 35 different countries for movies and 19 different countries for shows. Since this is a categorical variable, this will create many dummy variables when we create our recipe. Plus, there are many countries with only one observation, so this will not make a good predictor variable for our models. We will group the main production company into regions (Asia/Oceania, North/South America, Europe, and Africa/Middle East) to make the data easier to work with. Similarly, there are a lot of categories for main genre, with 15 for movies and 12 for shows. For this variable, it isn’t as straightfoward grouping genres into categories, as they are already split by different genres. Therefore we will drop any levels that contain less than five observations.
m3 <- movies1 %>%
ggplot(aes(x=RELEASE_YEAR)) +
geom_histogram(aes(y=..density..), fill = "black") +
geom_density(alpha=0.7, fill="#ff9896") +
theme_hc() +
labs(x = "Release Year", y = "Density")
s3 <- shows1 %>%
ggplot(aes(x=RELEASE_YEAR)) +
geom_histogram(aes(y=..density..), fill = "black") +
geom_density(alpha=0.7, fill="#c49c94") +
theme_hc() +
labs(x = "Release Year", y = "Density")
m3 + s3
Looking at the bar graphs for release year, we see that the data is heavily skewed left. There are a few observations with release years in the 1900’s, but because there are so few we will only look at movies and shows released in or after the year 2000. We will also change release year from a numeric variable to a categorical variable.
movies2 <- subset(movies1, MAIN_PRODUCTION!="XX" & RELEASE_YEAR >= 2000) # XX is not a country
movies_genre_counts <- table(movies2$MAIN_GENRE)
movies_selected_genres <- movies_genre_counts[movies_genre_counts >= 5]
movies2 <- subset(movies2, MAIN_GENRE %in% names(movies_selected_genres)) # only keeping main genre levels with more than 5 obs
movies <- movies2 %>%
mutate(REGION = forcats::fct_collapse(MAIN_PRODUCTION,
AsiaOceania = c("CN", "HK", "ID", "IN", "JP",
"KH", "KR", "TH", "AU", "NZ"),
AfricaME = c("CD", "MW", "ZA", "PS", "TR"),
NSAmerica = c("CA", "US", "AR", "BR", "MX"),
Europe = c("BE", "DE", "DK", "ES", "FR",
"GB", "HU", "IE", "IT", "LT",
"NL", "NO", "PL", "UA"))) %>%
select(-MAIN_PRODUCTION)
movies$RELEASE_YEAR <- factor(movies$RELEASE_YEAR, ordered=TRUE)
shows2 <- subset(shows1, RELEASE_YEAR >= 2000)
shows_genre_counts <- table(shows2$MAIN_GENRE)
shows_selected_genres <- shows_genre_counts[shows_genre_counts >= 5]
shows2 <- subset(shows2, MAIN_GENRE %in% names(shows_selected_genres))
shows <- shows2 %>%
mutate(REGION = forcats::fct_collapse(MAIN_PRODUCTION,
AsiaOceania = c("IN", "JP", "KR", "AU"),
AfricaME = c("TR", "IL"),
NSAmerica = c("CA", "US", "BR"),
Europe = c("BE", "DE", "DK", "ES", "FI",
"FR", "GB", "IT", "NO", "SE"))) %>%
select(-MAIN_PRODUCTION)
shows$RELEASE_YEAR <- factor(shows$RELEASE_YEAR, ordered=TRUE)
Let’s take a quick look at our new data:
m4 <- movies %>%
group_by(MAIN_GENRE) %>%
summarize(count = n()) %>%
ggplot(aes(x=count, y=reorder(MAIN_GENRE, count))) +
geom_bar(stat = "identity", fill = "#ff9896") +
geom_text(aes(label=count), vjust=0.5, hjust=-0.25, size=3) +
theme_hc() +
labs(x = "Count", y = "Main Genre")
s4 <- shows %>%
group_by(MAIN_GENRE) %>%
summarize(count = n()) %>%
ggplot(aes(x=count, y=reorder(MAIN_GENRE, count))) +
geom_bar(stat = "identity", fill = "#c49c94") +
geom_text(aes(label=count), vjust=0.5, hjust=-0.25, size=3) +
theme_hc() +
labs(x = "Count", y = "Main Genre")
m4 + s4
m5 <- movies %>%
group_by(REGION) %>%
summarize(count = n()) %>%
ggplot(aes(x=count, y=reorder(REGION, count))) +
geom_bar(stat = "identity", fill = "#ff9896") +
geom_text(aes(label=count), vjust=0.5, hjust=-0.25, size=3) +
theme_hc() +
labs(x = "Count", y = "Region")
s5 <- shows %>%
group_by(REGION) %>%
summarize(count = n()) %>%
ggplot(aes(x=count, y=reorder(REGION, count))) +
geom_bar(stat = "identity", fill = "#c49c94") +
geom_text(aes(label=count), vjust=0.5, hjust=-0.25, size=3) +
theme_hc() +
labs(x = "Count", y = "Region")
m5 + s5
m6 <- movies %>%
ggplot(aes(x=RELEASE_YEAR)) +
geom_bar(fill = "#ff9896") +
theme_hc() +
labs(x = "Release Year", y = "Count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 6))
s6 <- shows %>%
ggplot(aes(x=RELEASE_YEAR)) +
geom_bar(fill = "#c49c94") +
theme_hc() +
labs(x = "Release Year", y = "Count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 6))
m6 + s6
Before getting into our model building, we want to look at the distribution of movies and shows. Let’s look at the distribution of scores, main genres, and regions.
movies$TYPE <- rep("movie", nrow(movies))
shows$TYPE <- rep("show", nrow(shows))
netflix_combined <- dplyr::bind_rows(movies[c(3:9)], shows[c(3:6,8:10)])
netflix_combined %>%
ggplot(aes(x=TYPE, y=SCORE, fill=TYPE)) +
geom_boxplot() +
theme_hc() +
scale_fill_manual(values = c("#ff9896", "#c49c94")) +
labs(x = "Type", y = "Score", title = "Box Plot of Score Distribution", fill = "Type")
netflix_combined %>%
ggplot(aes(x=TYPE, fill=MAIN_GENRE)) +
geom_bar() +
theme_hc() +
labs(x = "Type", y = "Count", title = "Stacked Bar Chart of Main Genres", fill = "Main Genre")
netflix_combined %>%
ggplot(aes(x=TYPE, fill=REGION)) +
geom_bar() +
theme_hc() +
labs(x = "Type", y = "Count", title = "Stacked Bar Chart of Regions", fill = "Region")
netflix_combined %>%
ggplot(aes(x=NUMBER_OF_VOTES, fill=TYPE)) +
geom_density(alpha = 0.7) +
scale_fill_manual(values = c("#ff9896", "#c49c94")) +
theme_hc() +
labs(x = "Number of Votes", y = "Density", title = "Density Plot of Number of Votes", fill = "Type")
Here are some things we observe:
SCORE: Movie scores range from 6.9 to 9.0 and show scores range from 7.5 to 9.5. Show scores have a higher median than movie scores, but their ranges are about the same.
MAIN GENRE: For both movies and shows, drama is the genre with the most observations, making up about 1/3 of each dataset. This is followed by thriller then comedy for movies, and comedy and scifi for shows. This distribution is very obviously heavily uneven, which is something we’ll have to keep in mind when we’re building our models.
REGION: For movies, there is a decent proportion of observations rom North/South America, Asia/Oceania, and Europe/Middle East, with only two observations in Africa. For shows, over half of the observations are from North/South America.
NUMBER OF VOTES: The distribution of the number of votes is heavily skewed right. This aligns with our understanding that most movies and shows will have less votes, and only a few really popular ones have a higher nunber of votes.
Now that we have explored individual variables, we want to look at if there is any relationship between variables.
movies %>%
dplyr::select(SCORE, NUMBER_OF_VOTES, DURATION) %>%
cor() %>%
corrplot(type="lower", method="color", diag=FALSE, addCoef.col = "black", number.cex = 1)
shows %>%
dplyr::select(SCORE, NUMBER_OF_VOTES, DURATION, NUMBER_OF_SEASONS) %>%
cor() %>%
corrplot(type="lower", method="color", diag=FALSE, addCoef.col = "black", number.cex = 1)
While our dataset does not contain many numeric variables, it is still interesting to look at the correlation plot of what we have. The strongest correlation is the number of votes and score. This may be explained by the fact that the more popular a movie or show is, the more high scores it receives. The number of seasons of a show and the number of votes also has a moderately strong correlation. This also isn’t surprising, as a show having more seasons often means it is popular and long-running and will accumulate more votes.
Since main genre is our variable of interest, we want to see how each genre is correlated with our predictor variables. In this plot we see the distribution of scores for each genre of movies and shows. For movies, there is a wide range of score values for each genre. Scifi, documentary, and comedy have the highest median as well as range, as horror has the lowest. For shows, there is not much of a drastic difference in score ranges compared to the movies plot. Drama, documentary, scifi, and action have the largest range while war has the smallest.
m8 <- movies %>%
ggplot(aes(x=SCORE, y=NUMBER_OF_VOTES, color=MAIN_GENRE)) +
geom_point() +
theme_hc() +
labs(x = "Score", y = "Number of Votes", title="Movies", color = "Main Genre") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
legend.key.size = unit(1, 'cm'),
legend.key.height = unit(0.5, 'cm'),
legend.key.width = unit(0.5, 'cm'),
legend.title = element_text(size=6),
legend.text = element_text(size=4))
s8 <- shows %>%
ggplot(aes(x=SCORE, y=NUMBER_OF_VOTES, color=MAIN_GENRE)) +
geom_point() +
theme_hc() +
labs(x = "Score", y = "Number of Votes", title="Shows", color = "Main Genre") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
legend.key.size = unit(0.5, 'cm'),
legend.key.height = unit(0.5, 'cm'),
legend.key.width = unit(0.5, 'cm'),
legend.title = element_text(size=6),
legend.text = element_text(size=4))
m8 + s8
From the plots we see most movies have below 800,000 votes and most shows have below 500,000 with a few outliers. The observations with a high number of votes all have a high score as well. Main genre appears to be evenly scattered in the plot, so there might not be a high correlation between either variable and genre.
m9 <- movies %>%
ggplot(aes(y=MAIN_GENRE, x=DURATION)) +
geom_boxplot(fill="#ff9896") +
theme_hc() +
labs(x = "Duration", y = "Main Genre", title = "Movies")
s9 <- shows %>%
ggplot(aes(y=MAIN_GENRE, x=DURATION)) +
geom_boxplot(fill="#c49c94") +
theme_hc() +
labs(x = "Duration", y = "Main Genre", title = "Shows")
m9 + s9
There is an abundant amount of variation in duration between different genres for both movies and shows. For movies, scifi has the highest median duration with western, romance, drama, and crime closely behind. Drama contains many outliers of longer durations. For shows, crime has the highest median, and comedy has the lowest. Because of these clear distinctions, duration might be a good predictor for main genre.
The number of seasons is only present in our shows dataset. Let’s look at how it relates to the other variables.
shows %>%
ggplot(aes(x=NUMBER_OF_SEASONS, y=NUMBER_OF_VOTES, color=MAIN_GENRE)) +
geom_point() +
theme_hc() +
labs(x = "Number of Seasons", y = "Number of Votes", title = "Scatterplot of Number of Votes against Number of Seasons", color = "Main Genre")
shows %>%
ggplot(aes(x=MAIN_GENRE, y=NUMBER_OF_SEASONS)) +
geom_boxplot() +
theme_hc() +
labs(x = "Main Genre", y = "Number of Seasons", title = "Box Plot of Number of Seasons per Genre")
Here we are looking at relationships with the number of seasons for shows only. There does not appear to be an obvious relationship between the number of seasons and the number of votes, but upon a closer glance we see that all the shows with a high number of votes have at least 5 seasons. In the boxplot distribution of the number of seasons for each genre, we see that comedy has the highest median, but action and drama have multiple outliers with the highest number of seasons. Documentary is the genre with the least number of seasons, which makes sense.
Before building our models, we need to split the data into training and testing data sets. I will be using a 80/20 split and stratifying on the outcome variable, score, for both datasets. We will be building our models on our training set. Furthermore, we will be using the k-fold cross-validation method with five folds to evaluate the model’s test error rate on new data.
set.seed(131) # setting a seed to replicate results
movies_split <- initial_split(movies, prop=0.8, strata=MAIN_GENRE)
movies_train <- training(movies_split)
movies_test <- testing(movies_split)
shows_split <- initial_split(shows, prop=0.8, strata=MAIN_GENRE)
shows_train <- training(shows_split)
shows_test <- testing(shows_split)
nrow(movies_train); nrow(movies_test)
## [1] 261
## [1] 66
nrow(shows_train); nrow(shows_test)
## [1] 180
## [1] 46
For movies, there are 260 observations in the training dataset and 67 in the testing dataset. For shows, there are 180 in training and 46 in testing.
Next, we are going to build a recipe for all our models. This recipe is like a general guide of which predictors to use, how to use them, and what to do with them. Each model that we build will be using the same recipe, but will work with it in their own way unique to that specific model. The variables we are using to predict the main genre are release year, score, number of votes, duration, and region for movies, and the same plus number of seasons for shows.
movies_recipe <- movies_train %>%
recipe(MAIN_GENRE ~ RELEASE_YEAR + NUMBER_OF_VOTES + DURATION + SCORE + REGION) %>%
step_naomit() %>%
step_dummy(all_nominal_predictors()) %>%
step_interact(terms = ~ NUMBER_OF_VOTES:SCORE) %>%
step_normalize(NUMBER_OF_VOTES, DURATION, SCORE)
shows_recipe <- shows_train %>%
recipe(MAIN_GENRE ~ RELEASE_YEAR + NUMBER_OF_VOTES + DURATION + NUMBER_OF_SEASONS + SCORE + REGION) %>%
step_dummy(all_nominal_predictors()) %>%
step_interact(terms = ~ NUMBER_OF_SEASONS:NUMBER_OF_VOTES + NUMBER_OF_VOTES:SCORE) %>%
step_normalize(NUMBER_OF_VOTES, NUMBER_OF_SEASONS, DURATION, SCORE)
movies_recipe %>%
prep() %>%
bake(new_data = movies_train) %>%
head() %>%
kable() %>%
kable_styling("striped", full_width = TRUE) %>%
scroll_box(width = "1000px", height = "250px")
| NUMBER_OF_VOTES | DURATION | SCORE | MAIN_GENRE | RELEASE_YEAR_01 | RELEASE_YEAR_02 | RELEASE_YEAR_03 | RELEASE_YEAR_04 | RELEASE_YEAR_05 | RELEASE_YEAR_06 | RELEASE_YEAR_07 | RELEASE_YEAR_08 | RELEASE_YEAR_09 | RELEASE_YEAR_10 | RELEASE_YEAR_11 | RELEASE_YEAR_12 | RELEASE_YEAR_13 | RELEASE_YEAR_14 | RELEASE_YEAR_15 | RELEASE_YEAR_16 | RELEASE_YEAR_17 | RELEASE_YEAR_18 | RELEASE_YEAR_19 | RELEASE_YEAR_20 | RELEASE_YEAR_21 | RELEASE_YEAR_22 | REGION_AsiaOceania | REGION_Europe | REGION_AfricaME | NUMBER_OF_VOTES_x_SCORE |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9.9347858 | 0.8562228 | 2.972114 | scifi | -0.0314347 | -0.2284779 | 0.0716822 | 0.2189046 | -0.1113334 | -0.2016875 | 0.1496644 | 0.1773300 | -0.1860008 | -0.1453917 | 0.2195311 | 0.1049819 | -0.2491707 | -0.0545830 | 0.2733215 | -0.0085337 | -0.2893212 | 0.0900310 | 0.2918350 | -0.2054929 | 0.3693311 | 0.3646229 | 0 | 1 | 0 | 19960934.4 |
| -0.4651770 | 1.2868088 | 2.744095 | comedy | -0.2514778 | 0.1062688 | 0.1102803 | -0.2622438 | 0.2603488 | -0.1083494 | -0.1062453 | 0.2675052 | -0.2955184 | 0.1825521 | 0.0141874 | -0.2110746 | 0.3410505 | -0.3781936 | 0.3367461 | -0.2527514 | 0.1626832 | -0.0899621 | 0.0423123 | -0.0165073 | -0.0051674 | 0.0002459 | 1 | 0 | 0 | 179176.5 |
| -0.3565408 | -1.3325895 | 2.744095 | comedy | 0.3143473 | 0.2975526 | 0.1929906 | 0.0367141 | -0.1301744 | -0.2708734 | -0.3612340 | -0.3933900 | -0.3741137 | -0.3194662 | -0.2482792 | -0.1767750 | -0.1156103 | -0.0694340 | -0.0381949 | -0.0191462 | -0.0086753 | -0.0035097 | -0.0012441 | -0.0003746 | 0.0000837 | -0.0000374 | 0 | 0 | 0 | 383443.8 |
| -0.4940445 | -2.3014081 | 2.060037 | comedy | 0.1571737 | -0.1009553 | -0.2481307 | -0.1151112 | 0.1490154 | 0.2503273 | 0.0767729 | -0.1834196 | -0.2512957 | -0.0548575 | 0.2034769 | 0.2637044 | 0.0673126 | -0.2004877 | -0.3003033 | -0.1517471 | 0.1244823 | 0.3380159 | 0.3838752 | 0.2900003 | -0.1354521 | 0.0883456 | 0 | 0 | 0 | 120590.4 |
| 1.2245250 | 1.6456305 | 2.060037 | comedy | -0.0628695 | -0.2125376 | 0.1378504 | 0.1670079 | -0.1986872 | -0.0977828 | 0.2371605 | 0.0115947 | -0.2460934 | 0.0830810 | 0.2195311 | -0.1749698 | -0.1533358 | 0.2491226 | 0.0462544 | -0.2852890 | 0.0979241 | 0.2540047 | -0.2648964 | -0.1033544 | -0.4819533 | -0.2290638 | 1 | 0 | 0 | 3240568.8 |
| -0.4785257 | -0.9378857 | 1.832018 | documentary | 0.1257389 | -0.1487763 | -0.2315887 | -0.0115939 | 0.2260924 | 0.1740970 | -0.1025284 | -0.2529145 | -0.0863916 | 0.1918775 | 0.2404388 | -0.0016664 | -0.2500834 | -0.2173991 | 0.0623734 | 0.2897966 | 0.2326439 | -0.0536990 | -0.3232634 | -0.3939626 | 0.2417697 | -0.1843048 | 0 | 1 | 0 | 146993.0 |
shows_recipe %>%
prep() %>%
bake(new_data = shows_train) %>%
head() %>%
kable() %>%
kable_styling("striped", full_width = TRUE) %>%
scroll_box(width = "1000px", height = "250px")
| NUMBER_OF_VOTES | DURATION | NUMBER_OF_SEASONS | SCORE | MAIN_GENRE | RELEASE_YEAR_01 | RELEASE_YEAR_02 | RELEASE_YEAR_03 | RELEASE_YEAR_04 | RELEASE_YEAR_05 | RELEASE_YEAR_06 | RELEASE_YEAR_07 | RELEASE_YEAR_08 | RELEASE_YEAR_09 | RELEASE_YEAR_10 | RELEASE_YEAR_11 | RELEASE_YEAR_12 | RELEASE_YEAR_13 | RELEASE_YEAR_14 | RELEASE_YEAR_15 | RELEASE_YEAR_16 | RELEASE_YEAR_17 | RELEASE_YEAR_18 | RELEASE_YEAR_19 | RELEASE_YEAR_20 | RELEASE_YEAR_21 | RELEASE_YEAR_22 | REGION_Europe | REGION_NSAmerica | REGION_AfricaME | NUMBER_OF_SEASONS_x_NUMBER_OF_VOTES | NUMBER_OF_VOTES_x_SCORE |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| -0.3293163 | 0.4752719 | -0.8106196 | 2.6532162 | documentary | 0.2514778 | 0.1062688 | -0.1102803 | -0.2622438 | -0.2603488 | -0.1083494 | 0.1062453 | 0.2675052 | 0.2955184 | 0.1825521 | -0.0141874 | -0.2110746 | -0.3410505 | -0.3781936 | -0.3367461 | -0.2527514 | -0.1626832 | -0.0899621 | -0.0423123 | -0.0165073 | 0.0046333 | -0.0023010 | 1 | 0 | 0 | 41386 | 384889.8 |
| 0.4372215 | -0.0950544 | -0.8106196 | 2.2200381 | action | 0.3143473 | 0.2975526 | 0.1929906 | 0.0367141 | -0.1301744 | -0.2708734 | -0.3612340 | -0.3933900 | -0.3741137 | -0.3194662 | -0.2482792 | -0.1767750 | -0.1156103 | -0.0694340 | -0.0381949 | -0.0191462 | -0.0086753 | -0.0035097 | -0.0012441 | -0.0003746 | 0.0000837 | -0.0000374 | 0 | 1 | 0 | 175412 | 1596249.2 |
| 0.9291804 | 0.6653806 | -0.4782471 | 1.1370927 | crime | 0.1886084 | -0.0425075 | -0.2371027 | -0.2062064 | 0.0205539 | 0.2309552 | 0.2315029 | 0.0223611 | -0.2106354 | -0.2666222 | -0.1004317 | 0.1517793 | 0.2957191 | 0.2308505 | 0.0080595 | -0.2319204 | -0.3677303 | -0.3652133 | -0.2697823 | -0.1531004 | 0.0582812 | -0.0339028 | 0 | 1 | 0 | 522858 | 2248289.4 |
| 0.3711405 | -0.7287502 | 0.5188704 | 1.1370927 | action | 0.2200431 | 0.0265672 | -0.1929906 | -0.2636240 | -0.1318872 | 0.1012211 | 0.2628173 | 0.2376904 | 0.0460817 | -0.1807506 | -0.2975617 | -0.2381533 | -0.0434299 | 0.1810493 | 0.3360453 | 0.3781316 | 0.3247118 | 0.2256041 | 0.1284866 | 0.0591874 | -0.0190988 | 0.0101736 | 0 | 1 | 0 | 819290 | 1409178.8 |
| -0.0793705 | 0.4752719 | -0.4782471 | 1.1370927 | action | 0.2200431 | 0.0265672 | -0.1929906 | -0.2636240 | -0.1318872 | 0.1012211 | 0.2628173 | 0.2376904 | 0.0460817 | -0.1807506 | -0.2975617 | -0.2381533 | -0.0434299 | 0.1810493 | 0.3360453 | 0.3781316 | 0.3247118 | 0.2256041 | 0.1284866 | 0.0591874 | -0.0190988 | 0.0101736 | 0 | 0 | 0 | 170176 | 731756.8 |
| 0.1573236 | 0.7921198 | 0.5188704 | 0.9205036 | action | 0.1257389 | -0.1487763 | -0.2315887 | -0.0115939 | 0.2260924 | 0.1740970 | -0.1025284 | -0.2529145 | -0.0863916 | 0.1918775 | 0.2404388 | -0.0016664 | -0.2500834 | -0.2173991 | 0.0623734 | 0.2897966 | 0.2326439 | -0.0536990 | -0.3232634 | -0.3939626 | 0.2417697 | -0.1843048 | 1 | 0 | 0 | 632365 | 1075020.5 |
We will split our data into five different folds using k-fold cross-validation. This cross validation technique is used to estimate the test error rate using available training data by dividing the set of observations into k roughly equal size groups, or folds, then treating each fold as the validation set in turns and fitting the method on the other folds until each fold has been treated as the validation set. Using k-fold cross-validation rather than simply comparing our model results on the entire training set will help with avoiding overfitting to the training data, and will reduce the variance of the performance estimate.
movies_folds <- vfold_cv(movies_train, v=5)
shows_folds <- vfold_cv(shows_train, v=5)
movies_folds; shows_folds
## # 5-fold cross-validation
## # A tibble: 5 × 2
## splits id
## <list> <chr>
## 1 <split [208/53]> Fold1
## 2 <split [209/52]> Fold2
## 3 <split [209/52]> Fold3
## 4 <split [209/52]> Fold4
## 5 <split [209/52]> Fold5
## # 5-fold cross-validation
## # A tibble: 5 × 2
## splits id
## <list> <chr>
## 1 <split [144/36]> Fold1
## 2 <split [144/36]> Fold2
## 3 <split [144/36]> Fold3
## 4 <split [144/36]> Fold4
## 5 <split [144/36]> Fold5
1. Set up the model by specifying the type of model and its parameters, and setting up the engine and mode.
We will be building a total of five models for both movies and shows:
K-nearest neighbors (tuning number of neighbors)
Elastic net regression (tuning mixture and penalty)
Pruned decision trees (tuning cost complexity)
Random forest (tuning the number of predictors, number of trees, and minimum number of observations in a node)
Gradient-boosted trees (tuning the number of predictors, number of trees, and the learning rate)
For each of the models, the mode is set as classification, as that is our goal.
# knn
movies_knn <- nearest_neighbor(neighbors=tune()) %>%
set_engine("kknn") %>%
set_mode("classification")
shows_knn <- nearest_neighbor(neighbors=tune()) %>%
set_engine("kknn") %>%
set_mode("classification")
# elastic net multinomial regression
movies_en <- multinom_reg(mixture = tune(), penalty = tune()) %>%
set_mode("classification") %>%
set_engine("glmnet")
shows_en <- multinom_reg(mixture = tune(), penalty = tune()) %>%
set_mode("classification") %>%
set_engine("glmnet")
# pruned decision trees
movies_tree <- decision_tree(cost_complexity = tune()) %>%
set_engine("rpart") %>%
set_mode("classification")
shows_tree <- decision_tree(cost_complexity = tune()) %>%
set_engine("rpart") %>%
set_mode("classification")
# random forest
movies_forest <- rand_forest(mtry = tune(),
trees = tune(),
min_n = tune()) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("classification")
shows_forest <- rand_forest(mtry = tune(),
trees = tune(),
min_n = tune()) %>%
set_engine("ranger") %>%
set_mode("classification")
# gradient-boosted trees
movies_bt <- boost_tree(mtry = tune(),
trees = tune(),
learn_rate = tune()) %>%
set_engine("xgboost") %>%
set_mode("classification")
shows_bt <- boost_tree(mtry = tune(),
trees = tune(),
learn_rate = tune()) %>%
set_engine("xgboost") %>%
set_mode("classification")
2. Set up the workflow using the workflow() function and add the established model and recipe.
# knn
movies_knn_workflow <- workflow() %>%
add_model(movies_knn) %>%
add_recipe(movies_recipe)
shows_knn_workflow <- workflow() %>%
add_model(shows_knn) %>%
add_recipe(shows_recipe)
# elastic net multinomial regression
movies_en_workflow <- workflow() %>%
add_model(movies_en) %>%
add_recipe(movies_recipe)
shows_en_workflow <- workflow() %>%
add_model(shows_en) %>%
add_recipe(shows_recipe)
# pruned decision trees
movies_tree_workflow <- workflow() %>%
add_model(movies_tree) %>%
add_recipe(movies_recipe)
shows_tree_workflow <- workflow() %>%
add_model(shows_tree) %>%
add_recipe(shows_recipe)
# random forest
movies_forest_workflow <- workflow() %>%
add_model(movies_forest) %>%
add_recipe(movies_recipe)
shows_forest_workflow <- workflow() %>%
add_model(shows_forest) %>%
add_recipe(shows_recipe)
# gradient-boosted trees
movies_bt_workflow <- workflow() %>%
add_model(movies_bt) %>%
add_recipe(movies_recipe)
shows_bt_workflow <- workflow() %>%
add_model(shows_bt) %>%
add_recipe(shows_recipe)
3. Set up tuning grids for the parameters we want tuned, and specify the ranges as well as the number of levels we want.
# knn
movies_knn_grid <- grid_regular(neighbors(range=c(1, 10)), levels=10)
shows_knn_grid <- movies_knn_grid
# en
movies_en_grid <- grid_regular(penalty(range=c(0,3),
trans = identity_trans()),
mixture(range=c(0, 1)), levels=10)
shows_en_grid <- movies_en_grid
# pruned decision trees
movies_tree_grid <- grid_regular(cost_complexity(range = c(-3, -1)), levels = 10)
shows_tree_grid <- movies_tree_grid
# random forest
movies_forest_grid <- grid_regular(mtry(range = c(1, 6)),
trees(range = c(200, 600)),
min_n(range = c(10, 20)),
levels = 5)
shows_forest_grid <- movies_forest_grid
# gradient-boosted trees
movies_bt_grid <- grid_regular(mtry(range = c(1, 6)),
trees(range = c(200, 600)),
learn_rate(range = c(-10, -1)),
levels = 5)
shows_bt_grid <- movies_bt_grid
4. Tune each of the models using the workflow as the object, folds as the resamples, and created grids.
# knn
movies_knn_tune <- tune_grid(
object = movies_knn_workflow,
resamples = movies_folds,
grid = movies_knn_grid,
)
shows_knn_tune <- tune_grid(
object = shows_knn_workflow,
resamples = shows_folds,
grid = shows_knn_grid,
)
# elastic net multinomial regression
movies_en_tune <- tune_grid(
object = movies_en_workflow,
resamples = movies_folds,
grid = movies_en_grid
)
shows_en_tune <- tune_grid(
object = shows_en_workflow,
resamples = shows_folds,
grid = shows_en_grid
)
# pruned decision tree
movies_tree_tune <- tune_grid(
object = movies_tree_workflow,
resamples = movies_folds,
grid = movies_tree_grid
)
shows_tree_tune <- tune_grid(
object = shows_tree_workflow,
resamples = shows_folds,
grid = shows_tree_grid
)
# random forest
movies_forest_tune <- tune_grid(
object = movies_forest_workflow,
resamples = movies_folds,
grid = movies_forest_grid
)
shows_forest_tune <- tune_grid(
object = shows_forest_workflow,
resamples = shows_folds,
grid = shows_forest_grid
)
# gradient-boosted trees
movies_bt_tune <- tune_grid(
object = movies_bt_workflow,
resamples = movies_folds,
grid = movies_bt_grid
)
shows_bt_tune <- tune_grid(
object = shows_bt_workflow,
resamples = shows_folds,
grid = shows_bt_grid
)
Because tuning each of the models takes a long time, we will save the results after running them into RDA files so that we don’t have to rerun them every time.
save(movies_knn_tune, file="movies_knn_results.rda")
save(movies_en_tune, file="movies_en_results.rda")
save(movies_tree_tune, file="movies_tree_results.rda")
save(movies_forest_tune, file="movies_forest_results.rda")
save(movies_bt_tune, file="movies_bt_results.rda")
save(shows_knn_tune, file="shows_knn_results.rda")
save(shows_en_tune, file="shows_en_results.rda")
save(shows_tree_tune, file="shows_tree_results.rda")
save(shows_forest_tune, file="shows_forest_results.rda")
save(shows_bt_tune, file="shows_bt_results.rda")
5. Load the saved results back in to use for our analysis.
load(file="movies_knn_results.rda")
load(file="movies_en_results.rda")
load(file="movies_tree_results.rda")
load(file="movies_forest_results.rda")
load(file="movies_bt_results.rda")
load(file="shows_knn_results.rda")
load(file="shows_en_results.rda")
load(file="shows_tree_results.rda")
load(file="shows_forest_results.rda")
load(file="shows_bt_results.rda")
6. Collect metrics of the tuned models.
movies_knn_metrics <- collect_metrics(movies_knn_tune)
movies_en_metrics <- collect_metrics(movies_en_tune)
movies_tree_metrics <- collect_metrics(movies_tree_tune)
movies_forest_metrics <- collect_metrics(movies_forest_tune)
movies_bt_metrics <- collect_metrics(movies_bt_tune)
shows_knn_metrics <- collect_metrics(shows_knn_tune)
shows_en_metrics <- collect_metrics(shows_en_tune)
shows_tree_metrics <- collect_metrics(shows_tree_tune)
shows_forest_metrics <- collect_metrics(shows_forest_tune)
shows_bt_metrics <- collect_metrics(shows_bt_tune)
We have collected metrics from our model results, so now it is finally time to compare them and see which model was the best fit for our dataset. The performance is measured by the area under the ROC curve (ROC AUC), which measures the overall performance of our classifiers. A higher AUC means a better performance.
Let’s look at the plotted results from our models. The autoplot function in R allows us to visualize the result of each tuned parameter in our models.
autoplot(movies_knn_tune)
autoplot(shows_knn_tune)
We tuned our k-nearest neighbors models at 10 levels from 1 to 10 neighbors. For movies, the highest ROC AUC was 0.538 with k = 3. For shows, the highest ROC AUC was 0.653 with k = 2.
autoplot(movies_en_tune)
autoplot(shows_en_tune)
We tuned our elastic net models at 10 levels of penalty and mixture. For movies, our best ROC AUC was 0.681 with penalty = 0 and mixture = 0 . For shows, our best ROC AUC was 0.661 with penalty = 0 and mixture = 0. Both values are the highest ROC AUC value for both models.
autoplot(movies_tree_tune)
autoplot(shows_tree_tune)
For our decision tree model, we tuned 10 levels of cost complexity from -3 to 1. Our best ROC AUC for movies was 0.6 with a cost complexity of 0.0359. This did better than our knn model, but not as well as the elastic net model. The highest ROC AUC for shows was 0.611 with 0.0599. This did not do as well as the other two models.
autoplot(movies_forest_tune)
autoplot(shows_forest_tune)
For our random forest model, we tuned 5 levels of the number of predictors from 1 to 6, the number of trees from 200 to 600, and the minimum number of data points per node from 10 to 20. The highest ROC AUC was 0.63 for movies with mtry = 6, trees = 400, and min_n = 17, and 0.637 for shows with mtry = 4, trees = 200, and min_n = 20. Both of these are very high, and may be worth looking into.
autoplot(movies_bt_tune)
autoplot(shows_bt_tune)
We tuned 5 levels of number of predictors from 1 to 6, number of trees from 200 to 600, and learning rate from -10 to -1 for our boosted trees model. The highest ROC AUC for movies was 0.625 with mtry = 6, trees = 300, and learn_rate = 0.1, just behind the random forest model. For shows it was 0.629 with mtry = 2, trees = 600, and learn_rate = 0.1.
Here is a visualization of the highest ROC AUC produced by each of our models.
movies_knn_highest <- bind_cols(arrange(movies_knn_metrics[movies_knn_metrics$.metric=="roc_auc","mean"], desc(mean))[1,], "K-nearest neighbors")
movies_en_highest <- bind_cols(arrange(movies_en_metrics[movies_en_metrics$.metric=="roc_auc","mean"], desc(mean))[1,], "Elastic Net")
movies_tree_highest <- bind_cols(arrange(movies_tree_metrics[movies_tree_metrics$.metric=="roc_auc","mean"], desc(mean))[1,], "Pruned Decision Tree")
movies_forest_highest <- bind_cols(arrange(movies_forest_metrics[movies_forest_metrics$.metric=="roc_auc","mean"], desc(mean))[1,], "Random Forest")
movies_bt_highest <- bind_cols(arrange(movies_bt_metrics[movies_bt_metrics$.metric=="roc_auc","mean"], desc(mean))[1,], "Boosted Decision Tree")
shows_knn_highest <- bind_cols(arrange(shows_knn_metrics[shows_knn_metrics$.metric=="roc_auc","mean"], desc(mean))[1,], "K-nearest neighbors")
shows_en_highest <- bind_cols(arrange(shows_en_metrics[shows_en_metrics$.metric=="roc_auc","mean"], desc(mean))[1,], "Elastic Net")
shows_tree_highest <- bind_cols(arrange(shows_tree_metrics[shows_tree_metrics$.metric=="roc_auc","mean"], desc(mean))[1,], "Pruned Decision Tree")
shows_forest_highest <- bind_cols(arrange(shows_forest_metrics[shows_forest_metrics$.metric=="roc_auc","mean"], desc(mean))[1,], "Random Forest")
shows_bt_highest <- bind_cols(arrange(shows_bt_metrics[shows_bt_metrics$.metric=="roc_auc","mean"], desc(mean))[1,], "Boosted Decision Tree")
movies_results <- bind_rows(movies_knn_highest, movies_en_highest, movies_tree_highest, movies_forest_highest, movies_bt_highest)
colnames(movies_results) <- c("ROC_AUC", "Model")
shows_results <- bind_rows(shows_knn_highest, shows_en_highest, shows_tree_highest, shows_forest_highest, shows_bt_highest)
colnames(shows_results) <- c("ROC_AUC", "Model")
movies_results %>%
ggplot(aes(x=Model, y=ROC_AUC)) +
geom_col(fill="#ff9896") +
geom_text(aes(label = round(ROC_AUC, 3)), vjust = -0.5) +
ylim(0, 1) +
theme_hc() +
labs(y = "ROC AUC", title = "Comparing ROC AUC for Movies")
shows_results %>%
ggplot(aes(x=Model, y=ROC_AUC)) +
geom_col(fill="#c49c94") +
geom_text(aes(label = round(ROC_AUC, 3)), vjust = -0.5) +
ylim(0, 1) +
theme_hc() +
labs(y = "ROC AUC", title = "Comparing ROC AUC for Shows")
Once again, the elastic net model resulted in the highest ROC AUC value for both the movies and shows dataset. For both, the penalty and mixture happen to be 0. This is the model we will be using to fit to our testing dataset.
show_best(movies_en_tune, metric="roc_auc")[1,] %>% kable()
| penalty | mixture | .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|---|---|
| 0 | 0 | roc_auc | hand_till | 0.6805623 | 2 | 0.0095567 | Preprocessor1_Model001 |
show_best(shows_en_tune, metric="roc_auc")[1,] %>% kable()
| penalty | mixture | .metric | .estimator | mean | n | std_err | .config |
|---|---|---|---|---|---|---|---|
| 0 | 0 | roc_auc | hand_till | 0.661077 | 5 | 0.0275367 | Preprocessor1_Model001 |
Before fitting the model to our testing sets, we will finalize the elastic net workflow using our best model, then fit it to our entire training dataset.
movies_best <- select_best(movies_en_tune, metric="roc_auc")
movies_final_workflow <- finalize_workflow(movies_en_workflow, movies_best)
movies_final_fit <- fit(movies_final_workflow, movies_train)
shows_best <- select_best(shows_en_tune, metric="roc_auc")
shows_final_workflow <- finalize_workflow(shows_en_workflow, shows_best)
shows_final_fit <- fit(shows_final_workflow, shows_train)
And finally, we can fit it to our testing sets and look at how it performed with our new data.
movies_final_test <- augment(movies_final_fit, new_data=movies_test) %>%
dplyr::select(MAIN_GENRE, starts_with(".pred"))
movies_final_test$MAIN_GENRE <- factor(movies_final_test$MAIN_GENRE)
roc_auc(movies_final_test, truth=MAIN_GENRE,
.pred_comedy:.pred_crime:.pred_documentary:.pred_drama:.pred_fantasy:.pred_horror:.pred_romance:.pred_scifi:.pred_thriller) %>%
kable()
| .metric | .estimator | .estimate |
|---|---|---|
| roc_auc | hand_till | 0.515081 |
shows_final_test <- augment(shows_final_fit, new_data=shows_test) %>%
select(MAIN_GENRE, starts_with(".pred"))
shows_final_test$MAIN_GENRE <- factor(shows_final_test$MAIN_GENRE)
roc_auc(shows_final_test, truth=MAIN_GENRE,
.pred_action:.pred_comedy:.pred_crime:.pred_documentary:.pred_drama:.pred_scifi:.pred_war) %>%
kable()
| .metric | .estimator | .estimate |
|---|---|---|
| roc_auc | hand_till | 0.7508291 |
The ROC AUC value of our model for movies was 0.551, and the value for shows was 0.721. Evidently, our model did not do the best on our movies dataset. It might have overfitted to our training set, resulting in a lower ROC AUC for our testing set. On the other hand, our model did very well for shows. We can say that our model is able to predict shows better than it is able to predict movies.
Because the ROC AUC values were quite similar when we tested them on our training dataset, it might be worth exploring the results of some of the other models on our testing datasets for movies to see if they can produce a higher ROC AUC. In particular, we are interested in the random forest and boosted decision tree models. We will be using the same steps to fit our models to our testing data.
movies_best_forest <- select_best(movies_forest_tune, metric="roc_auc")
movies_final_workflow_forest <- finalize_workflow(movies_forest_workflow, movies_best_forest)
movies_final_fit_forest <- fit(movies_final_workflow_forest, movies_train)
movies_final_test_forest <- augment(movies_final_fit_forest, new_data=movies_test) %>%
dplyr::select(MAIN_GENRE, starts_with(".pred"))
movies_final_test_forest$MAIN_GENRE <- factor(movies_final_test_forest$MAIN_GENRE)
roc_auc(movies_final_test_forest, truth=MAIN_GENRE,
.pred_comedy:.pred_crime:.pred_documentary:.pred_drama:.pred_fantasy:.pred_horror:.pred_romance:.pred_scifi:.pred_thriller) %>%
kable()
| .metric | .estimator | .estimate |
|---|---|---|
| roc_auc | hand_till | 0.5654177 |
movies_best_bt <- select_best(movies_bt_tune, metric="roc_auc")
movies_final_workflow_bt <- finalize_workflow(movies_bt_workflow, movies_best_bt)
movies_final_fit_bt <- fit(movies_final_workflow_bt, movies_train)
movies_final_test_bt <- augment(movies_final_fit_bt, new_data=movies_test) %>%
dplyr::select(MAIN_GENRE, starts_with(".pred"))
movies_final_test_bt$MAIN_GENRE <- factor(movies_final_test_bt$MAIN_GENRE)
roc_auc(movies_final_test_bt, truth=MAIN_GENRE,
.pred_comedy:.pred_crime:.pred_documentary:.pred_drama:.pred_fantasy:.pred_horror:.pred_romance:.pred_scifi:.pred_thriller) %>%
kable()
| .metric | .estimator | .estimate |
|---|---|---|
| roc_auc | hand_till | 0.5011232 |
Interestingly, the boosted trees model fits the movies testing dataset much better with an ROC AUC of 0.603. This is still not as high as the ROC AUC for shows, but it is an improvement from the elastic net model.
The reason the elastic net model produced a lower ROC AUC when fitted to our testing dataset is likely because the model overfitted to the training data. It is also because we did not have that many predictors in our movies dataset, so that might have caused the elastic net model to not be the best at predicting.
Let’s take a look at the variable importance graph using the vip function. This tells us which predictors were the most important in determining the genre of a movie or show. For movies, the duration, number of votes, score, and the interaction between the number of votes and score were the most important. For shows, the region and release year take up the top spots of the chart.
movies_final_fit_bt %>%
extract_fit_parsnip() %>%
vip()
shows_final_fit %>%
extract_fit_parsnip() %>%
vip()
Through this project, we learned that the best predictors for a movie’s genre are the duration, number of votes, and score of the movie on IMBd, and the best predictors for a show’s genre are the region and release year. Surprisingly, the predictors are very different. After fitting multiple models to both our datasets, we come to the conclusion that the best model for movies is the boosted trees model and the best for shows is the elastic net model. However, both models, especially in predicting movies, have much room for improvement.
One of our main issues was that we did not have that many predictor variables to start with. Given more, our models might have turned out a lot more accurate. It might also be worth looking into other models such as the Naive Bayes and Support Vector Machine models. Having a larger dataset with more observations would also help our model. We were trying to predict a factor with many levels, and some of these levels only contained a few observations. If there was more data for each main genre, our model would be better trained.
This dataset was taken from the Kaggle dataset “Netflix TV Shows and Movies”.